Snowflake micropartitions in case of inserts - snowflake-cloud-data-platform

How does micro partitions of snowflake works if we are inserting the data one by one in table and if we are loading the whole file?
Will the number of micro partitions increases in case of loading the data one by one using inserts as compared to load the data from files?

Not sure what the purpose of the question is, but it is obviously MUCH better to batch load your file, rather than do individual record inserts, not just because of your question, but also because of the actual load performance.
As for your question, Snowflake will determine that if it has a single-record micropartition and you are inserting 1 additional record, it will create a new 2-record micropartition (assuming they aren't very large records). So, the active table won't have more micropartitions per se, but you will still have that single-record micropartition as part of your time-travel and fail-safe. So, if you were to do this 1 record at a time, over and over again, you'd have a lot of extra micropartitions to pay for.
Single-record inserts are just not a good idea in Snowflake. It's worth looking into ways to batch them up and load them in bulk.

Related

MSSQL Creating and loading data

I'm interested to hear other developers views on creating and loading data as the current site I'm working on has a completely different take on DWH loading.
The protocol used currently to load a fact table has a number of steps;
Drop old table
Recreate Table with no PK/Clustered Index
Load cleaned/new data
Create PK & Indexes
I'm wondering how much work really goes on under the covers with step 4? The data are loaded without a Clusterd index so I'm assuming that the natural order of the data load defines its order on disk. When step 4. creates a primary key (clustered) it will re-order the data on disk into that order. Would it not be better to load the data and have the PK/Clustered Index already defined thereby reduce server workload?
When inserting a large amount of records, the overhead in updating the index can often be larger than simply creating it from scratch. The performance gain comes from inserting onto a heap which is the most efficient way to get data into a table.
The only way you can know if your import strategy is faster with the indexes left intact, will be to test both on your own environment and compare.
Up to my thoughts Indexers are Good for Select. and may be bad for DML operations.
And if you are loading the Huge amount of data that means you need to update Indexers for every insert. This may lag the performance. Some times it may go beyond the limit.

Fact table partitioning: how to handle updates in ETL?

We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. The idea is to insert new rows into the Fact table and update modified rows.
The partition column would be date (int, YYYYMMDD) and we are considering to partition by month.
As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations. We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, e.g for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. But how can we handle the modified rows which should update the corresponding rows in the Fact table? Those rows can contain data from the previous month(s). Does partition switch help here? Usually INSERT and UPDATE rows are determined by an ETL tool (e.g. SSIS in our case) or by MERGE statement. How partitioning works in these kind of situations?
I'd take another look at the design and try to figure out if there's a way around the updates. Here are a few implications of updating the fact table:
Performance: Updates are fully logged transactions. Big fact tables also have lots of data to read and write.
Cubes: Updating the fact table requires reprocessing the affected partitions. As your fact table continues to grow, the cube processing time will continue to as well.
Budget: Fast storage is expensive. Updating big fact tables will require lots of fast reads and writes.
Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). Any non-error scenario will be changing the originally recorded transaction.
What is changing? Are the changing pieces really attributes of a dimension? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task?
Another possibility, can this be accomplished via offsetting transactions? Example:
The initial InvoiceAmount was $10.00. Accounting later added $1.25 for tax then billed the customer for $11.25. Rather than updating the value to $11.25, insert a record for $1.25. The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish.
Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. You'll be reading and writing more data, requiring more IOPS from the storage subsytem. When you get ready to do analytics, cube processing will then throw in more problems.
You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. Is there business value/justification in needing all of those IOPS for your constant changing "fact" table?
If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. Otherwise, you'll never be able to scale.
Switching does not help here.
Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. That might speed it up. Be careful not to trigger lock escalation so you get good concurrency.
Also make sure that you update the rows mostly in ascending sort order of the clustered index. This helps with disk IO (this technique might not work well with multi-threading).
There are as many reasons to update a fact record as there are non-identifying attributes in the fact. Unless you plan on a "first delete" then "insert", you simply cannot avoid updates. You cannot simply say "record the metric deltas as new facts".

What does "bulk load" mean?

Jumping from article to article, I can see everywhere the expression "bulk loading".
What does it really (technically) mean?
What does it imply?
Explanation based on use-cases is welcome.
Indexes are usually optimized for inserting rows one at a time. When you are adding a great deal of data at once, inserting rows one at a time may be inefficient. For instance, with a B-Tree, the optimal way to insert a single key is very poor way of adding a bunch of data to an empty index.
Instead you pursue a different strategy with B-Trees. You presort all of the data, and group it in blocks. You can then build a new B-Tree by transforming the blocks into tree nodes. Although both techniques have the same asymptotic performance, O(n log(n)), the bulk-load operation has much smaller factor.
Bulk loading is a way to load data (typically into a database) in 'large chunks'. Where you might enter a customer or a purchase order or information about items in inventory one at a time into your system, bulk loading takes a file of this same sort of information and loads hundreds/thousands/millions of records in a short period of time.
If you convert from one kind of DBMS to another, you would hope not to enter all the information into the new DB from the old DB. Instead, you would dump the information from the old DB to a file in a format that can be easily read by the new DB and then import that data into the new DB.
That's what bulk loading entails (at the 35K foot level, anyway)
Bulk loading is used to import/export large amounts of data. Usually bulk operations are not logged and transactional integrity might not work as expected. Often bulk operations bypass triggers and integrity checks like constraints. This improves performance, for large amounts of data, quite significantly.
One thing to remember is that bulk loading implies that the data content from the source to target is the same, but this is only true if the source system is acquiesced. For any data source, and especially true of large data, the source data can change after it has been read and the data transfer is happening. Traditionally online systems either have to go off line or suspend updates if an exact point it time capture that matches the source is required.

choosing table design for database performance

I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.

Your first gut feeling on this SqlServer design question

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.
With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.
I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.
If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.
Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

Resources