We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. The idea is to insert new rows into the Fact table and update modified rows.
The partition column would be date (int, YYYYMMDD) and we are considering to partition by month.
As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations. We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, e.g for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. But how can we handle the modified rows which should update the corresponding rows in the Fact table? Those rows can contain data from the previous month(s). Does partition switch help here? Usually INSERT and UPDATE rows are determined by an ETL tool (e.g. SSIS in our case) or by MERGE statement. How partitioning works in these kind of situations?
I'd take another look at the design and try to figure out if there's a way around the updates. Here are a few implications of updating the fact table:
Performance: Updates are fully logged transactions. Big fact tables also have lots of data to read and write.
Cubes: Updating the fact table requires reprocessing the affected partitions. As your fact table continues to grow, the cube processing time will continue to as well.
Budget: Fast storage is expensive. Updating big fact tables will require lots of fast reads and writes.
Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). Any non-error scenario will be changing the originally recorded transaction.
What is changing? Are the changing pieces really attributes of a dimension? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task?
Another possibility, can this be accomplished via offsetting transactions? Example:
The initial InvoiceAmount was $10.00. Accounting later added $1.25 for tax then billed the customer for $11.25. Rather than updating the value to $11.25, insert a record for $1.25. The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish.
Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. You'll be reading and writing more data, requiring more IOPS from the storage subsytem. When you get ready to do analytics, cube processing will then throw in more problems.
You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. Is there business value/justification in needing all of those IOPS for your constant changing "fact" table?
If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. Otherwise, you'll never be able to scale.
Switching does not help here.
Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. That might speed it up. Be careful not to trigger lock escalation so you get good concurrency.
Also make sure that you update the rows mostly in ascending sort order of the clustered index. This helps with disk IO (this technique might not work well with multi-threading).
There are as many reasons to update a fact record as there are non-identifying attributes in the fact. Unless you plan on a "first delete" then "insert", you simply cannot avoid updates. You cannot simply say "record the metric deltas as new facts".
Related
I am fetching data from a table having 20K row from a third-party source where the way of filling the table can't be changed table.
On the third party, the table is filled as following
New data is coming at every 15 seconds approx 7K rows.
At any given time only the last three timestamps will be available rest data will be deleted.
No index on the table is there. Neither it can be requested due to unavoidable reasons and might be slowness in the insert.
I am aware of the following
Row locks and up the hierarchy other locks are being taken while data insert.
The problem persists with select with NO LOCK.
There is no Join with any other table while fetching as we are joining the tables when data is at the local with us in the temp table.
When the data insertion is stopped at the third party the data comes in 100ms to 122ms.
When service is on it takes 3 to 5 seconds.
Any help/suggestion/approach is appreciated in advance.
The following is a fairly high-end solution. Based on what you have said I believe it would work, but there'd be a lot of detail to work out.
Briefly: table partitions.
Set up a partition scheme on this table
Based on an article I read recently, this CAN be done with unindexed heaps
Data is loaded every 15 seconds? Then the partitions need to be based on those 15 second intervals
For a given dataload (i.e. once per 15 seconds):
Create the "next" partition
Load the data
SWITCH the new partition (new data) into the main table
SWITCH the oldest partition out (data for only three time periods present at a time, right?
Drop that "retired" partition
While potentially efficient and effective, this would be very messy. The big problem I see is, if they can't add a simple index, I don't see how they could possibly set up table partitioning.
Another similar trick is to set up partitioned views, which essentially is "roll your own partitioning". This would go something like:
Have a set of identically structured tables
Create a view UNION ALLing the tables
On dataload, create a new table, load data into that table, then ALTER VIEW to include that newest table and remove the oldest table.
This could have worse locking/blocking issues than the partitioning solution, though much depends on how heavy your read activity is. And, of course, it is much messier than just adding an index.
We have a few tables that are periodically recomputed within SQL Server. The computation takes a few seconds to a few minutes and we do the following:
Dump the results in computed_table_tmp
Drop computed_table
Rename computed_table_tmp to computed_table. (and all indexes).
However, we seem to still run into concurrency issues where we have our application requesting a view that utilizes this computed table at the precise moment where it no longer exists.
What would be the best technique to avoid this type of problem while ensuring high availability?
If this table is part of your high-availability requirement, then you can't do this the way you've been doing it. Dropping a table in a production SQL environment breaks the concept of high availability.
You might be able to accomplish what you're trying to achieve by creating one or more partitions on this table. A partitioned table is divided into subgroups of rows that can be spread across more than one filegroup in your database. For querying purposes, however, the table is still a single logical entity. The advantage of using a table partition is that you can move around subsets of your data without breaking the integrity of the database, i.e., high-availability is still in place.
In your scenario, you'd have to modify your process such that all activity takes place in the production version of the table. The new rows are dumped in to a separate partition, based on the value of your partition function. Then you'll need to switch the partitions.
One of the things you'll need to do is identify a column in your table that may be used as the partition column, which determines which partition a row will be allocated to. This might be, for example, a datetime column indicating when the row was generated. You can even use a computed column for this purpose, provided it is a PERSISTED column.
One caveat: Table partitioning is not available in all editions of SQL Server... I don't believe Standard has it.
I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.
We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.
With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.
I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.
If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.
Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.
I have a table with more than a millon rows. This table is used to index tiff images. Each image has fields like date, number, etc. I have users that index these images in batches of 500. I need to know if it is better to first insert 500 rows and then perform 500 updates or, when the user finishes indexing, to do the 500 inserts with all the data. A very important thing is that if I do the 500 inserts at first, this time is free for me because I can do it the night before.
So the question is: is it better to do inserts or inserts and updates, and why? I have defined a id value for each image, and I also have other indices on the fields.
Updates in Sql server result in ghosted rows - i.e. Sql crosses one row out and puts a new one in. The crossed out row is deleted later.
Both inserts and updates can cause page-splits in this way, they both effectively 'add' data, it's just that updates flag the old stuff out first.
On top of this updates need to look up the row first, which for lots of data can take longer than the update.
Inserts will just about always be quicker, especially if they are either in order or if the underlying table doesn't have a clustered index.
When inserting larger amounts of data into a table look at the current indexes - they can take a while to change and build. Adding values in the middle of an index is always slower.
You can think of it like appending to an address book: Mr Z can just be added to the last page, while you'll have to find space in the middle for Mr M.
Doing the inserts first and then the updates does seem to be a better idea for several reasons. You will be inserting at a time of low transaction volume. Since inserts have more data, this is a better time to do it.
Since you are using an id value (which is presumably indexed) for updates, the overhead of updates will be very low. You would also have less data during your updates.
You could also turn off transactions at the batch (500 inserts/updates) level and use it for each individual record, thus reducing some overhead.
Finally, test this out to see the actual performance on your server before making a final decision.
This isn't a cut and dry question. Krishna's and Galegian's points are spot on.
For updates, the impact will be lessened if the updates are affecting fixed-length fields. If updating varchar or blob fields, you may add a cost of page splits during update when the new value surpasses the length of the old value.
I think inserts will run faster. They do not require a lookup (when you do an update you are basically doing the equivalent of a select with the where clause). And also, an insert won't lock the rows the way an update will, so it won't interfere with any selects that are happening against the table at the same time.
The execution plan for each query will tell you which one should be more expensive. The real limiting factor will be the writes to disk, so you may need to run some tests while running perfmon to see which query causes more writes and causes the disk queue to get the longest (longer is bad).
I'm not a database guy, but I imagine doing the inserts in one shot would be faster because the updates require a lookup whereas the inserts do not.