Massive deletes in snowflake fact tables - snowflake-cloud-data-platform

In snowflake. Is there any best practice for erasing/purge old data i.e. running over the edge historically for big fact tables? After all, this is what you traditionally use partitions for in a conventional RDBMS as SQL Server. Truncate a partition in SQL Server takes milliseconds.
Best Regards
Jan

You can use clustering on Snowflake in a similar fashion as partitions in SQL Server. They aren't exactly the same, but if you are deleting old data by date, then you can cluster by that date. This way when you delete older micropartitions, Snowflake won't need to create new ones or search inside existing ones to find the records to delete...it'll simply delete the files that need to be deleted, which is a metadata operation and is fast.
That said, if you are loading data in order of the same date field, then your table may already be fairly well clustered on that date field. If the fact is very large, clustering on that date field may take some time if it isn't already naturally ordered that way, but it has a lot of benefits, including the use-case that you've asked about in this post.

Related

Mirroring legacy FlashFiler2 database to SQL Server

I maintain a very old data acquisition system which uses a legacy FlashFiler2 database. One of my customers would like the database tables to be mirrored into a SQL Server database for easier post-processing.
There are two types of tables:
Measured values: these tables get new timestamped data every day and my strategy would be to record the timestamp of the last data record that I have already mirrored for every measuring point and only add the new ones to the target database.
Mostly static tables: these tables rarely change and the records don't bear a timestamp. Maybe think of a customers table that rarely gets new entries and the existing records are changed very rarely.
To handle case 2 by brute force, I would have to either clear the target table and recreate it every day or compare each and every record for changes, detect deleted records and add new ones.
What is an efficient way to accomplish this task?
In a related post I found the idea to create MD5 hashes of the target records. The hash could then be used to compare records for changes. Of course I would still have to check for added and deleted records. Would this be worth the effort or should I go with one of the brute force methods?
My tools are: Visual Studio 2017, C# with the ADO.NET provider for SQL Server and FlashFiler2 Delphi components.
I ended up with the following:
1) for all tables that contain rarely changing data I chose to recreate them completely. The performance is indeed better than I had thought.
2) for the measured values tables I chose a configurable lookback from which on the data are chronologically copied to the target database. This is mostly redundant because the lookback by intention should include data that have already been transferred. It is necessary, however, because at times it might be necessary to correct some data in the source that have already been transferred. In my case those corrections usually occur within one or two months. So a lookback of 3 months would usually be enough.
The main performance bottleneck now seems to be my record by record insertion into the target database. I guess there are batch routines for this but that would be a topic for another post.

Can I use Hadoop to speed up a slow SQL stored procedure?

The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.
Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.

Which is the best way to take backup of all table data either merging one or separate in SQL Server?

In SQL Server 2005, I have a requirement to take backup the output table After Monthly Processing, the output table contains 50,000 rows. Which is the best way to take backup of the table for each month either taking backup as one table or monthly wise separate tables. Which method provides to save memory and best performance for managing data in SQL Server 2005?
I don't know if there would be any wrong answer here.. but it depends mostly on how this backup table will be used. What I would do:
Back everything up to the single table, but with another field named RunMonth (or something similar) that will indicate the month that those 50k records belong to. This would make querying the historical data much easier as opposed to having them in separate tables.
I hope that makes sense.
Taking a backup of the entire table is better not because it saves space but because it is much easier to handle.
But again, I would have multiple tables for each months rather handle the same with date/ date time field in a single table.
In my opinion, I would just backup the entire schema frequently so as to minimise any data lose.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

How reliable is SQL server replication?

We have a database on SQL Server 2000 which should be truncated from time to time. It looks like the easiest solution would be to create a duplicate database and copy the primary database there. Then the primary database may be safely truncated by specially tailored stored procedures.
One way replication would guarantee that the backup database contains all updates from the primary one.
We plan to use backup database for reporting and primary for operative data.
Primary database will be truncated at night once in 2 days.
Database is several gigabytes. Only several tables are quite large (1-2 mln rows)
What are possible pitfalls? How reliable would such a solution be? Will it slow down the primary database?
Update: Variant with DTS for doing copying sounds good but has own disadvantages. It requires quite robust script which would run for about an hour to copy updated rows. There is also issue with integrity constraints in primary database which would make truncating it non-trivial task. Because of this replication cold straighten things up considerably.
It is also possible but not quite good variant to use union VIEW because system which woks mostly in unattended mode whiteout dedicated support personnel. It is related issue but not technical though.
While replication is usually robust, there are times where it can break and require a refresh. Managing and maintaining replication can become complicated. Once the primary database is truncated, you'll have to make sure that action is not replicated. You may also need an improved system of row identification as after you've truncated the primary database tables a couple of times, you'll still have a complete history in your secondary database.
There is a performance hit on the publisher (primary) as extra threads have to run to read the transaction log. Unless you're under heavy load at the moment, you likely won't notice this effect. Transaction log management can become more important also.
Instead, I'd look at a different solution for your problem. For example, before truncating, you can take a backup of the database, and restore it as a new database name. You then have a copy of the database as it was before the truncation, and you can query both at once using three-part names.
You've mentioned that the purpose of the secondary data is to keep report off. In this case you can create a view like SELECT * FROM Primary.dbo.Table UNION ALL SELECT * FROM SecondaryDBJune2008.dbo.Table UNION ALL SELECT * FROM SecondaryDBOctober2008.dbo.Table. You wouild then need to keep this view up to date whenever you perform a truncate.
The other alternative would be to take a snapshot of the current data before truncation and insert it into a single reporting database. Then you'd just have the Primary and the Historical databases - no need to modify views once they're created.
How much data are we talking about in GB?
As you're planning to perform the truncation once every two days, I'd recommend the second alternative, snapshotting the data before truncation into a single Historical database. This can be easily done with a SQL Agent job, without having to worry about replication keeping the two sets of data in synch.
I would not use replication for this. We have a fairly complex replication setup running with 80+ branches replicating a few tables to one central database. When connectivity goes down for a few days, the data management issues are hair raising.
If you want to archive older data, rather use DTS. You can then build the copying and truncation/deletion of data into the same DTS package, setting it so that the deletion only happens if the copy was successful.

Resources