I have my data warehouse built on Amazon Redshift. The problem I am currently facing is, I have a huge fact table (with about 500M rows) in my schema with data for about 10 clients. I have processes that periodically (mostly daily) generates data for this fact table and requires a refresh, meaning - delete old data and insert the newly generated data.
The problem is, this bulk delete-insert operation leave holes in my fact table with a need to VACUUM which is time consuming and hence can't be done immediately. And this fact table (with huge holes due to deleted data) dramatically impacts the snapshot time which consumes data from the fact and dimension tables and refreshes it in the downstream presentation area. How can I optimize such bulk data refresh in a DWH environment?
I believe this should be a well known problem in DWH with some recommended ways to solve. Can anyone please point out the recommended solution?
P.S: One solution can be to create table per client and have a view on top of it which does a union of all the underlying tables. In this case, if I break the fact table per client it is pretty small and can be vacuumed quickly after delete-insert, but looking for solutions with better maintainability.
You might try to play with different types of Vacuum, there is "VACUUM DELETE ONLY", which will reclaim the space, but won't resort lines, not sure if its applicable for your use case.
More info here: http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html
Or I used deep copy approach when I was fighting with vacuuming tables with too many columns. Problem with this could be that you will need a lot of space for intermediate step.
More info here:
http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html
Related
I need help in select the right Database for my data.
I have table of usersItems with the following columns:
userId , itemId , attribute1 ,attribute2,attribute3 .......,attribute10
There are 1000 users +- , and every user has 100,000 items(avg) .
The data in the table updated every 3 hours from third-party API. (I'm getting file for each user with the updated items.. not all of them really changed).
The data from this table in use as is, without aggregations. Each user can see his items in the website.
Today I'm using mySQL and have few problems with the massive update of records.
I thought to migrate the data to redshift or one of the NOSQL dbs.
I'll be happy to hear your recommendations.
I'd look into Aerospike but this kind of work-load. This is what we've been using over here and we are quite happy with it. It's an open source NoSQL database that is designed for both in-memory and solid state disk-operation. It can handle a lot of IOPS (100k+ IOPS in-memory, like Redis), if you manage to avoid ultra-hot keys (more than 1000 IOPS on single 'rows'). It can be configured to replicate all data and has synchronic (SSD only) as well as asynchronic (HDD) persistence support.
For your use case, you'd have to decide whether lists can be bound in size to 128k - 1MB or whether you need infinite growable lists per user. This will make the difference between using a normal list (limited to record size, 128k-1M) or using a large ordered list (infinite). Note that you overcome your MySQL-limitations at the moment that you start having a single primary key for the list you are trying to query. No joins or anything is required. It only get's a bit fuzzy if list entries need their own primary key (e.g. m:n relations) - however, there's concepts that work around that like de-normalization.
When you give it a few days of figuring out what works best, Aerospike can help you with consistently low latencies that only a product grown up in AdSpace can offer. You might not need it right now, but we found that working with SSDs gives us a lot more freedom in terms of what we store due to the much higher capacity compared to memory.
Other options I'd evaluate would be Redis or Couchbase - if asynchronous persistence is not an issue for you.
You should try an in-memory database with persistence: Redis, CouchBase, Tarantool, Aerospike.
Each of them should handle your workload of heavy updates. This works because these databases don't change the table space on each update, but rather append to the transaction log only. Which is the fastest possible way to persist updates.
So if your update workload is less than 100Mb/sec (the speed of linear writing of a spinning disk) then those databases should help you.
But everything depends on your specific workload though. You can test all of those databases and choose the best one.
This is a follow up to a previous question of mine after definitely deciding on partition switching as the best way to quickly get data into a heavily indexed fact type table that needs to remain available to readers.
While it seems to be the best way, it is not quite good enough to really satisfy the requirement to allow several (< 5) users to bulk insert at the same time, have the new data indexed and to appear in the indexed views (not necessarily real indexed views, just selects that rely on indices).
The idea of partitioning was that each partition and the index subtree rooted at the partition could, in parallel, be locked as read-only, copied into a working table, new data inserted/updated and the indexes rebuilt then switched back into the main table so readers aren't affected.
The problem is the single working table. Each parallel bulk insert needs its own copy, with the same constraints as the main table to allow switching.
So far I've hit several walls trying to get around this bottleneck:
I tried partitioning the working table using the same partition
function. This doesn't work because you can't disable the indexes on
a partition basis to insert into one while rebuilding the index on
another.
Creating a temporary table as the working table. This
doesn't work because, while you can use the same index names, you
can't easily dynamically create the constraints and can't switch
that in anyway.
Have a fixed set of named working tables? How can I select one and work with it under an alias so I have just one stored proc?
Dynamic SQL? I've tried very hard to avoid going that route. It's complicated as it is.
Big challenge but has anyone got any ideas before I accept the bottleneck? Would Sql 2012 help? How do proper data warehouses cope with this?
How do proper data warehouses cope with this? Compromise and set realistic goals for the EDW. The data warehouse can't be everything to everyone. Make sure that what you're implementing is the best solution for the business (not just the techies/analysts). Are your goals realistic if you cannot find solutions from experienced peers and experts?
Associate a cost with all of the hoops you jump through. Does the data really need to be up to the minute? What if I told you that we needed to spend another $200,000 on storage because we're constantly duplicating partitions and rebuilding indexes and the current solution can't keep up with the IOPS demand? At some point, they're going to figure out that it's not free. While you don't need to just say no, you do need to be realistic and up-front about the cost associated. Additionally, your storage admin will thank you.
As for 2012, there is a new columnstore index which can reduce or replace all of the current nonclustereds you're using to cover all you're analysts search requests. It's highly compressed, covers a very wide variety of search arguments, and utilizes the new Batch execution mode. It performs best on low selectivity queries like the ones frequently performed on fact tables. The one catch is that you can't directly do updates. You'll have to switch the partition out to a staging table, drop the columnstore on the staging table, update the staging table, add the columnstore back, then switch the partition back into the fact table. It sounds like alot, but could be significantly faster and require less IO than maintaining all of those nonclustereds.
My question has always been "Is it really a fact table if it is constantly changing?". This is not OLTP is it? Try offsetting transactions or at least push all updates to a scheduled off-peak time. Updating fact tables is becoming a thing of the past. All of the big boys are moving toward the "Update frowned upon" column oriented architecture for data warehousing. PowerPivot and the Analysis Services Tabular Model are built on the columnstore technology.
Finally, Review Kimballs' DW Toolkit books. He has several that lay out best practices and cover edge-case scenarios. What I learned from them was that Data Warehouse Development is not just Database Development on steroids. It also involves politics and focusing resources on what's best for the business.
My company's workflow relies on two MSSQL databases: one for web content data and the other is the ERP. I've been doing some proof of concept on some tools that would serve as an intermediary that builds a relationship between the datasets, and thus far its proving to be monumentally faster.
Instead of reading out to both datasets, I'd much rather house a database on the local Linux box that represents the data I'm working with. That way, its less pressure on the system as a whole.
What I don't understand is if there is a way to update this new database without completely dropping the table each time or running through a punishing line by line check. If the records had timestamps, this would be easy...but they don't.
Does anyone have any tips? Am I missing some crucial feature I don't know about, or am I
SOL?
Finally, is there one preferred database stack out there anyone thinks might work better than another? I'm not committed to any technology at this point.
Thanks!
Have you read about the MERGE statement in SQL? It allows update or inserts on existing tables.
I assume your tables have primary keys even though you say there is no timestamp.
Looking for strategies for a very large table with data maintained for reporting and historical purposes, a very small subset of that data is used in daily operations.
Background:
We have Visitor and Visits tables which are continuously updated by our consumer facing site. These tables contain information on every visit and visitor, including bots and crawlers, direct traffic that does not result in a conversion, etc.
Our back end site allows management of the visitor's (leads) from the front end site. Most of the management occurs on a small subset of our visitors (visitors that become leads). The vast majority of the data in our visitor and visit tables is maintained only for a much smaller subset of user activity (basically reporting type functionality). This is NOT an indexing problem, we have done all we can with indexing and keeping our indexes clean, small, and not fragmented.
ps: We do not currently have the budget or expertise for a data warehouse.
The problem:
We would like the system to be more responsive to our end users when they are querying, for instance, the list of their assigned leads. Currently the query is against a huge data set of mostly irrelevant data.
I am pondering a few ideas. One involves new tables and a fairly major re-architecture, I'm not asking for help on that. The other involves creating redundant data, (for instance a Visitor_Archive and a Visitor_Small table) where the larger visitor and visit tables exist for inserts and history/reporting, the smaller visitor1 table would exist for managing leads, sending lead an email, need leads phone number, need my list of leads, etc..
The reason I am reaching out is that I would love opinions on the best way to keep the Visitor_Archive and the Visitor_Small tables in sync...
Replication? Can I use replication to replicate only data with a certain column value (FooID = x)
Any other strategies?
It sounds like your table is a perfect candidate for partitioning. Since you didn't mention it, I'll briefly describe it, and give you some links, in case you're not aware of it.
You can divide the rows of a table/index across multiple physical or logical devices, and is specifically meant to improve performance of data sets where you may only need a known subset of the data to work with at any time. Partitioning a table still allows you to interact with it as one table (you don't need to reference partitions or anything in your queries), but SQL Server is able to perform several optimizations on queries that only involve one partition of the data. In fact, in Designing Partitions to Manage Subsets of Data, the AdventureWorks examples pretty much match your exact scenario.
I would do a bit of research, starting here and working your way down: Partitioned Tables and Indexes.
Simple solution: create separate table, de-normalized, with all fields in it. Create stored procedure, that will update this table on your schedule. Create SQl Agent job to call the SP.
Index the table as you see how it's queried.
If you need to purge history, create another table to hold it and another SP to populate it and clean main report table.
You may end up with multiple report tables - it's OK - space is cheap these days.
I have some data that isn't properly "partitioned" (for lack of a better word).
All inserts, processing and reporting happen on the same table. The bulk of the processing happens not long after the insert and not long after that it becomes immutable (we're talking days).
I could do all inserts and processing on a new table that I replicate to the old table. When I detect that the data has become immutable I would delete the data from the new table, but I would edit the delete replication stored procedure so that the delete did not replicate.
How bad an idea is this? <edit1>That is, editing the replication stored procedure.</edit1>
It seems attractive at the moment (I haven't slept on it yet) because it might mitigate a performance problem with only very small changes to the application. It also seems like it might be a good way to shoot myself in the foot.
Edit1:
I like the idea of inserting into two tables because I can avoid the view and the maintenance window described in Jono's answer. No offense, Jono, I actually use this technique elsewhere.
I might want to use replication because one table might be in another database (I know, I didn't mention this) and that way I don't have to worry about committing to two tables, I just let replication handle that.
My actual concern (that I didn't make clear) is that editing the replication stored procedure could end up being a deployment/maintenance headache.
I wouldn't advocate replication to solve a performance issue (unless it's a problem of physical data distribution); if anything it's going to slow your system down as the changes are propagated to their destination. If you're using a single server, I'd suggest adding a second table with the same schema as the first, but with your indexes optimised for the kind of work you do in your processing phase. Then create a view that selects from both tables, and use that view in any query where you want the union of both tables. You could then throw more hardware at the second table (I'm thinking of a separate file group over more spindles) and then migrate the data on a weekly delay into the first table, during an available maintenance window.