I have my data warehouse built on Amazon Redshift. The problem I am currently facing is, I have a huge fact table (with about 500M rows) in my schema with data for about 10 clients. I have processes that periodically (mostly daily) generates data for this fact table and requires a refresh, meaning - delete old data and insert the newly generated data.
The problem is, this bulk delete-insert operation leave holes in my fact table with a need to VACUUM which is time consuming and hence can't be done immediately. And this fact table (with huge holes due to deleted data) dramatically impacts the snapshot time which consumes data from the fact and dimension tables and refreshes it in the downstream presentation area. How can I optimize such bulk data refresh in a DWH environment?
I believe this should be a well known problem in DWH with some recommended ways to solve. Can anyone please point out the recommended solution?
P.S: One solution can be to create table per client and have a view on top of it which does a union of all the underlying tables. In this case, if I break the fact table per client it is pretty small and can be vacuumed quickly after delete-insert, but looking for solutions with better maintainability.
You might try to play with different types of Vacuum, there is "VACUUM DELETE ONLY", which will reclaim the space, but won't resort lines, not sure if its applicable for your use case.
More info here: http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html
Or I used deep copy approach when I was fighting with vacuuming tables with too many columns. Problem with this could be that you will need a lot of space for intermediate step.
More info here:
http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html
This is a follow up to a previous question of mine after definitely deciding on partition switching as the best way to quickly get data into a heavily indexed fact type table that needs to remain available to readers.
While it seems to be the best way, it is not quite good enough to really satisfy the requirement to allow several (< 5) users to bulk insert at the same time, have the new data indexed and to appear in the indexed views (not necessarily real indexed views, just selects that rely on indices).
The idea of partitioning was that each partition and the index subtree rooted at the partition could, in parallel, be locked as read-only, copied into a working table, new data inserted/updated and the indexes rebuilt then switched back into the main table so readers aren't affected.
The problem is the single working table. Each parallel bulk insert needs its own copy, with the same constraints as the main table to allow switching.
So far I've hit several walls trying to get around this bottleneck:
I tried partitioning the working table using the same partition
function. This doesn't work because you can't disable the indexes on
a partition basis to insert into one while rebuilding the index on
another.
Creating a temporary table as the working table. This
doesn't work because, while you can use the same index names, you
can't easily dynamically create the constraints and can't switch
that in anyway.
Have a fixed set of named working tables? How can I select one and work with it under an alias so I have just one stored proc?
Dynamic SQL? I've tried very hard to avoid going that route. It's complicated as it is.
Big challenge but has anyone got any ideas before I accept the bottleneck? Would Sql 2012 help? How do proper data warehouses cope with this?
How do proper data warehouses cope with this? Compromise and set realistic goals for the EDW. The data warehouse can't be everything to everyone. Make sure that what you're implementing is the best solution for the business (not just the techies/analysts). Are your goals realistic if you cannot find solutions from experienced peers and experts?
Associate a cost with all of the hoops you jump through. Does the data really need to be up to the minute? What if I told you that we needed to spend another $200,000 on storage because we're constantly duplicating partitions and rebuilding indexes and the current solution can't keep up with the IOPS demand? At some point, they're going to figure out that it's not free. While you don't need to just say no, you do need to be realistic and up-front about the cost associated. Additionally, your storage admin will thank you.
As for 2012, there is a new columnstore index which can reduce or replace all of the current nonclustereds you're using to cover all you're analysts search requests. It's highly compressed, covers a very wide variety of search arguments, and utilizes the new Batch execution mode. It performs best on low selectivity queries like the ones frequently performed on fact tables. The one catch is that you can't directly do updates. You'll have to switch the partition out to a staging table, drop the columnstore on the staging table, update the staging table, add the columnstore back, then switch the partition back into the fact table. It sounds like alot, but could be significantly faster and require less IO than maintaining all of those nonclustereds.
My question has always been "Is it really a fact table if it is constantly changing?". This is not OLTP is it? Try offsetting transactions or at least push all updates to a scheduled off-peak time. Updating fact tables is becoming a thing of the past. All of the big boys are moving toward the "Update frowned upon" column oriented architecture for data warehousing. PowerPivot and the Analysis Services Tabular Model are built on the columnstore technology.
Finally, Review Kimballs' DW Toolkit books. He has several that lay out best practices and cover edge-case scenarios. What I learned from them was that Data Warehouse Development is not just Database Development on steroids. It also involves politics and focusing resources on what's best for the business.
My company's workflow relies on two MSSQL databases: one for web content data and the other is the ERP. I've been doing some proof of concept on some tools that would serve as an intermediary that builds a relationship between the datasets, and thus far its proving to be monumentally faster.
Instead of reading out to both datasets, I'd much rather house a database on the local Linux box that represents the data I'm working with. That way, its less pressure on the system as a whole.
What I don't understand is if there is a way to update this new database without completely dropping the table each time or running through a punishing line by line check. If the records had timestamps, this would be easy...but they don't.
Does anyone have any tips? Am I missing some crucial feature I don't know about, or am I
SOL?
Finally, is there one preferred database stack out there anyone thinks might work better than another? I'm not committed to any technology at this point.
Thanks!
Have you read about the MERGE statement in SQL? It allows update or inserts on existing tables.
I assume your tables have primary keys even though you say there is no timestamp.
What are some strategies that people have had success with for maintaining a change history for data in a fairly complex database. One of the applications that I frequently use and develop for could really benefit from a more comprehensive way of tracking how records have changed over time. For instance, right now records can have a number of timestamp and modified user fields, but we currently don't have a scheme for logging multiple change, for instance if an operation is rolled back. In a perfect world, it would be possible to reconstruct the record as it was after each save, etc.
Some info on the DB:
Needs to have the capacity to grow by thousands of records per week
50-60 Tables
Main revisioned tables may have several million records each
Reasonable amount of foreign keys and indexes set
Using PostgreSQL 8.x
One strategy you could use is MVCC, Multi-Value Concurrency Control. In this scheme, you never do updates to any of your tables, you just do inserts, maintaining version numbers for each record. This has the advantage of providing an exact snapshot from any point in time, and it also completely sidesteps the update lock problems that plague many databases.
But it makes for a huge database, and selects all require an extra clause to select the current version of a record.
If you are using Hibernate, take a look at JBoss Envers. From the project homepage:
The Envers project aims to enable easy versioning of persistent JPA classes. All that you have to do is annotate your persistent class or some of its properties, that you want to version, with #Versioned. For each versioned entity, a table will be created, which will hold the history of changes made to the entity. You can then retrieve and query historical data without much effort.
This is somewhat similar to Eric's approach, but probably much less effort. Don't know, what language/technology you use to access the database, though.
In the past I have used triggers to construct db update/insert/delete logging.
You could insert a record each time one of the above actions is done on a specific table into a logging table that keeps track of the action, what db user did it, timestamp, table it was performed on, and previous value.
There is probably a better answer though as this would require you to cache the value before the actual delete or update was performed I think. But you could use this to do rollbacks.
The only problem with using Triggers is that it adds to performance overhead of any insert/update/delete. For higher scalability and performance, you would like to keep the database transaction to a minimum. Auditing via triggers increase the time required to do the transaction and depending on the volume may cause performance issues.
another way is to explore if the database provides any way of mining the "Redo" logs as is the case in Oracle. Redo logs is what the database uses to recreate the data in case it fails and has to recover.
Similar to a trigger (or even with) you can have every transaction fire a logging event asynchronously and have another process (or just thread) actually handle the logging. There would be many ways to implement this depending upon your application. I suggest having the application fire the event so that it does not cause unnecessary load on your first transaction (which sometimes leads to locks from cascading audit logs).
In addition, you may be able to improve performance to the primary database by keeping the audit database in a separate location.
I use SQL Server, not PostgreSQL, so I'm not sure if this will work for you or not, but Pop Rivett had a great article on creating an audit trail here:
Pop rivett's SQL Server FAQ No.5: Pop on the Audit Trail
Build an audit table, then create a trigger for each table you want to audit.
Hint: use Codesmith to build your triggers.
In the past I've never been a fan of using triggers on database tables. To me they always represented some "magic" that was going to happen on the database side, far far away from the control of my application code. I also wanted to limit the amount of work the DB had to do, as it's generally a shared resource and I always assumed triggers could get to be expensive in high load scenarios.
That said, I have found a couple of instances where triggers have made sense to use (at least in my opinion they made sense). Recently though, I found myself in a situation where I sometimes might need to "bypass" the trigger. I felt really guilty about having to look for ways to do this, and I still think that a better database design would alleviate the need for this bypassing. Unfortunately this DB is used by mulitple applications, some of which are maintained by a very uncooperative development team who would scream about schema changes, so I was stuck.
What's the general consesus out there about triggers? Love em? Hate em? Think they serve a purpose in some scenarios?
Do think that having a need to bypass a trigger means that you're "doing it wrong"?
Triggers are generally used incorrectly, introduce bugs and therefore should be avoided. Never design a trigger to do integrity constraint checking that crosses rows in a table (e.g "the average salary by dept cannot exceed X).
Tom Kyte, VP of Oracle has indicated that he would prefer to remove triggers as a feature of the Oracle database because of their frequent role in bugs. He knows it is just a dream, and triggers are here to stay, but if he could he would remove triggers from Oracle, he would (along with the WHEN OTHERS clause and autonomous transactions).
Can triggers be used correctly? Absolutely.
The problem is - they are not used correctly in so
many cases that I'd be willing to give
up any perceived benefit just to get
rid of the abuses (and bugs) caused by
them. - Tom Kyte
Think of a database as a great big object - after each call to it, it ought to be in a logically consistent state.
Databases expose themselves via tables, and keeping tables and rows consistent can be done with triggers. Another way to keep them consistent is to disallow direct access to the tables, and only allowing it through stored procedures and views.
The downside of triggers is that any action can invoke them; this is also a strength - no-one is going to screw up the integrity of the system through incompetence.
As a counterpoint, allowing access to a database only through stored procedures and views still allows the backdoor access of permissions. Users with sufficient permissions are trusted not to break database integrity, all others use stored procedures.
As to reducing the amount of work: databases are stunningly efficient when they don't have to deal with the outside world; you'd be really surprised how much even process switching hurts performance. That's another upside of stored procedures: rather than a dozen calls to the database (and all the associated round trips), there's one.
Bunching stuff up in a single stored proc is fine, but what happens when something goes wrong? Say you have 5 steps and the first step fails, what happens to the other steps? You need to add a whole bunch of logic in there to cater for that situation. Once you start doing that you lose the benefits of the stored procedure in that scenario.
Business logic has to go somewhere, and there's a lot of implied domain rules embedded in the design of a database - relations, constraints and so on are an attempt to codify business rules by saying, for example, a user can only have one password. Given you've started shoving business rules onto the database server by having these relations and so on, where do you draw the line? When does the database give up responsibility for the integrity of the data, and start trusting the calling apps and database users to get it right? Stored procedures with these rules embedded in them can push a lot of political power into the hands of the DBAs. It comes down to how many tiers are going to exist in your n-tier architecture; if there's a presentation, business and data layer, where does the separation between business and data lie? What value-add does the business layer add? Will you run the business layer on the database server as stored procedures?
Yes, I think that having to bypass a trigger means that you're "doing it wrong"; in this case a trigger isn't for you.
I work with web and winforms apps in c# and I HATE triggers with a passion. I have never come across a situation where I could justify using a trigger over moving that logic into the business layer of the application and replicating the trigger logic there.
I don't do any DTS type work or anything like that, so there might be some use cases for using trigger there, but if anyone in any of my teams says that they might want to use a trigger they better have prepared their arguments well because I refuse to stand by and let triggers be added to any database I'm working on.
Some reasons why I don't like triggers:
They move logic into the database. Once you start doing that, you're asking for a world of pain because you lose your debugging, your compile time safety, your logic flow. It's all downhill.
The logic they implement is not easily visible to anyone.
Not all database engines support triggers so your solution creates dependencies on database engines
I'm sure I could think of more reasons off the top of my head but those alone are enough for me not to use triggers.
"Never design a trigger to do integrity constraint checking that crosses rows in a table" -- I can't agree. The question is tagged 'SQL Server' and CHECK constraints' clauses in SQL Server cannot contain a subquery; worse, the implementation seems to have a 'hard coded' assumption that a CHECK will involve only a single row so using a function is not reliable. So if I need a constraint which does legitimately involve more than one row -- and a good example here is the sequenced primary key in a classic 'valid time' temporal table where I need to prevent overlapping periods for the same entity -- how can I do that without a trigger? Remember this is a primary key, something to ensure I have data integrity, so enforcing it anywhere other than the DBMS is out of the question. Until CHECK constraints get subqueries, I don't see an alternative to using triggers for certain kinds of integrity constraints.
Triggers can be very helpful. They can also be very dangerous. I think they're fine for house cleaning tasks like populating audit data (created by, modified date, etc) and in some databases can be used for referential integrity.
But I'm not a big fan of putting lots of business logic into them. This can make support problematic because:
it's an extra layer of code to research
sometimes, as the OP learned, when you need to do a data fix the trigger might be doing things with the assumption that the data change is always via an application directive and not from a developer or DBA fixing a problem, or even from a different app
As for having to bypass a trigger to do something, it could mean you are doing something wrong, or it could mean that the trigger is doing something wrong.
The general rule I like to use with triggers is to keep them light, fast, simple, and as non-invasive as possible.
I find myself bypassing triggers when doing bulk data imports. I think it's justified in such circumstances.
If you end up bypassing the triggers very often though, you probably need to take another look at what you put them there for in the first place.
In general, I'd vote for "they serve a purpose in some scenarios". I'm always nervous about performance implications.
I'm not a fan, personally. I'll use them, but only when I uncover a bottleneck in the code that can be cleared by moving actions into a trigger. Generally, I prefer simplicity and one way to keep things simple is to keep logic in one place - the application. I've also worked on jobs where access is very compartmentalized. In those environments, the more code I pack into triggers the more people I have to engage for even the simplest fixes.
I first used triggers a couple of weeks ago. We changed over a production server from SQL 2000 to SQL 2005 and we found that the drivers were behaving differently with NText fields (storing a large XML document), dropping off the last byte. I used a trigger as a temporary fix to add an extra dummy byte (a space) to the end of the data, solving our problem until a proper solution could be rolled out.
Other than this special, temporary case, I would say that I would avoid them since they do hide what is going on, and the function they provide should be handled explictly by the developer rather then as some hidden magic.
Honestly the only time I use triggers to simulate a unique index that is allowed to have NULL that don't count for the uniqueness.
As to reducing the amount of work: databases are stunningly efficient when they don't have to deal with the outside world; you'd be really surprised how much even process switching hurts performance. That's another upside of stored procedures: rather than a dozen calls to the database (and all the associated round trips), there's one.
this is a little off topic, but you should also be aware that you're only looking at this from one potential positive.
Bunching stuff up in a single stored proc is fine, but what happens when something goes wrong? Say you have 5 steps and the first step fails, what happens to the other steps? You need to add a whole bunch of logic in there to cater for that situation. Once you start doing that you lose the benefits of the stored procedure in that scenario.
Total fan,
but really have to use it sparingly when,
Need to maintain consistency (especially when dimension tables are used in a warehouse and we need to relate the data in the fact table with their proper dimension . Sometime, the proper row in the dimension table can be very expensive to compute so you want the key to be written straight to the fact table, one good way to maintain that "relation" is with trigger.
Need to log changes (in a audit table for instance, it's useful to know what ##user did the change and when it occurred)
Some RDBMS like sql server 2005 also provide you with triggers on CREATE/ALTER/DROP statements (so you can know who created what table, when, dropped what column, when, etc..)
Honestly, using triggers in those 3 scenarios, I don't see why would you ever need to "disable" them.
The general rule of thumb is: do not use triggers. As mentioned before, they add overhead and complexity that can easily be avoided by moving logic out of the DB layer.
Also, in MS SQL Server, triggers are fired once per sql command, and not per row. For example, the following sql statement will execute the trigger only once.
UPDATE tblUsers
SET Age = 11
WHERE State = 'NY'
Many people, including myself, were under the impression that the triggers are fired on every row, but this isn't the case. If you have a sql statement like the one above that may change data in more than one row, you might want to include a cursor to update all records affected by the trigger. You can see how this can get convoluted very quickly.