Does putting a clustering key impact the bulk copy performance? For example if I have a large table and I am loading data into it. and it has 2 clustering key in it. does removing the clustering key helps in loading the file faster. or since clustering is an automated process, It happens in backend. say for example snowflake loads the data first and after that it does the clustering in backend while data is already available to use.
No, it does not affect bulk copy performance.
since clustering is an automated process, It happens in backend. say
for example snowflake loads the data first and after that it does the
clustering in backend while data is already available to use.
Yes, it is exactly how it works.
Related
I have my data warehouse built on Amazon Redshift. The problem I am currently facing is, I have a huge fact table (with about 500M rows) in my schema with data for about 10 clients. I have processes that periodically (mostly daily) generates data for this fact table and requires a refresh, meaning - delete old data and insert the newly generated data.
The problem is, this bulk delete-insert operation leave holes in my fact table with a need to VACUUM which is time consuming and hence can't be done immediately. And this fact table (with huge holes due to deleted data) dramatically impacts the snapshot time which consumes data from the fact and dimension tables and refreshes it in the downstream presentation area. How can I optimize such bulk data refresh in a DWH environment?
I believe this should be a well known problem in DWH with some recommended ways to solve. Can anyone please point out the recommended solution?
P.S: One solution can be to create table per client and have a view on top of it which does a union of all the underlying tables. In this case, if I break the fact table per client it is pretty small and can be vacuumed quickly after delete-insert, but looking for solutions with better maintainability.
You might try to play with different types of Vacuum, there is "VACUUM DELETE ONLY", which will reclaim the space, but won't resort lines, not sure if its applicable for your use case.
More info here: http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html
Or I used deep copy approach when I was fighting with vacuuming tables with too many columns. Problem with this could be that you will need a lot of space for intermediate step.
More info here:
http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html
I need help in select the right Database for my data.
I have table of usersItems with the following columns:
userId , itemId , attribute1 ,attribute2,attribute3 .......,attribute10
There are 1000 users +- , and every user has 100,000 items(avg) .
The data in the table updated every 3 hours from third-party API. (I'm getting file for each user with the updated items.. not all of them really changed).
The data from this table in use as is, without aggregations. Each user can see his items in the website.
Today I'm using mySQL and have few problems with the massive update of records.
I thought to migrate the data to redshift or one of the NOSQL dbs.
I'll be happy to hear your recommendations.
I'd look into Aerospike but this kind of work-load. This is what we've been using over here and we are quite happy with it. It's an open source NoSQL database that is designed for both in-memory and solid state disk-operation. It can handle a lot of IOPS (100k+ IOPS in-memory, like Redis), if you manage to avoid ultra-hot keys (more than 1000 IOPS on single 'rows'). It can be configured to replicate all data and has synchronic (SSD only) as well as asynchronic (HDD) persistence support.
For your use case, you'd have to decide whether lists can be bound in size to 128k - 1MB or whether you need infinite growable lists per user. This will make the difference between using a normal list (limited to record size, 128k-1M) or using a large ordered list (infinite). Note that you overcome your MySQL-limitations at the moment that you start having a single primary key for the list you are trying to query. No joins or anything is required. It only get's a bit fuzzy if list entries need their own primary key (e.g. m:n relations) - however, there's concepts that work around that like de-normalization.
When you give it a few days of figuring out what works best, Aerospike can help you with consistently low latencies that only a product grown up in AdSpace can offer. You might not need it right now, but we found that working with SSDs gives us a lot more freedom in terms of what we store due to the much higher capacity compared to memory.
Other options I'd evaluate would be Redis or Couchbase - if asynchronous persistence is not an issue for you.
You should try an in-memory database with persistence: Redis, CouchBase, Tarantool, Aerospike.
Each of them should handle your workload of heavy updates. This works because these databases don't change the table space on each update, but rather append to the transaction log only. Which is the fastest possible way to persist updates.
So if your update workload is less than 100Mb/sec (the speed of linear writing of a spinning disk) then those databases should help you.
But everything depends on your specific workload though. You can test all of those databases and choose the best one.
This is a follow up to a previous question of mine after definitely deciding on partition switching as the best way to quickly get data into a heavily indexed fact type table that needs to remain available to readers.
While it seems to be the best way, it is not quite good enough to really satisfy the requirement to allow several (< 5) users to bulk insert at the same time, have the new data indexed and to appear in the indexed views (not necessarily real indexed views, just selects that rely on indices).
The idea of partitioning was that each partition and the index subtree rooted at the partition could, in parallel, be locked as read-only, copied into a working table, new data inserted/updated and the indexes rebuilt then switched back into the main table so readers aren't affected.
The problem is the single working table. Each parallel bulk insert needs its own copy, with the same constraints as the main table to allow switching.
So far I've hit several walls trying to get around this bottleneck:
I tried partitioning the working table using the same partition
function. This doesn't work because you can't disable the indexes on
a partition basis to insert into one while rebuilding the index on
another.
Creating a temporary table as the working table. This
doesn't work because, while you can use the same index names, you
can't easily dynamically create the constraints and can't switch
that in anyway.
Have a fixed set of named working tables? How can I select one and work with it under an alias so I have just one stored proc?
Dynamic SQL? I've tried very hard to avoid going that route. It's complicated as it is.
Big challenge but has anyone got any ideas before I accept the bottleneck? Would Sql 2012 help? How do proper data warehouses cope with this?
How do proper data warehouses cope with this? Compromise and set realistic goals for the EDW. The data warehouse can't be everything to everyone. Make sure that what you're implementing is the best solution for the business (not just the techies/analysts). Are your goals realistic if you cannot find solutions from experienced peers and experts?
Associate a cost with all of the hoops you jump through. Does the data really need to be up to the minute? What if I told you that we needed to spend another $200,000 on storage because we're constantly duplicating partitions and rebuilding indexes and the current solution can't keep up with the IOPS demand? At some point, they're going to figure out that it's not free. While you don't need to just say no, you do need to be realistic and up-front about the cost associated. Additionally, your storage admin will thank you.
As for 2012, there is a new columnstore index which can reduce or replace all of the current nonclustereds you're using to cover all you're analysts search requests. It's highly compressed, covers a very wide variety of search arguments, and utilizes the new Batch execution mode. It performs best on low selectivity queries like the ones frequently performed on fact tables. The one catch is that you can't directly do updates. You'll have to switch the partition out to a staging table, drop the columnstore on the staging table, update the staging table, add the columnstore back, then switch the partition back into the fact table. It sounds like alot, but could be significantly faster and require less IO than maintaining all of those nonclustereds.
My question has always been "Is it really a fact table if it is constantly changing?". This is not OLTP is it? Try offsetting transactions or at least push all updates to a scheduled off-peak time. Updating fact tables is becoming a thing of the past. All of the big boys are moving toward the "Update frowned upon" column oriented architecture for data warehousing. PowerPivot and the Analysis Services Tabular Model are built on the columnstore technology.
Finally, Review Kimballs' DW Toolkit books. He has several that lay out best practices and cover edge-case scenarios. What I learned from them was that Data Warehouse Development is not just Database Development on steroids. It also involves politics and focusing resources on what's best for the business.
I am trying to implement a surrogate key generator using PIG.
I need to persist the last generated key in a Database and query the Database for the next available key.
Is there any support in PIG to query the Database using ODBC?
If yes, please provide guidance or some samples.
Sorry for not answering your question directly, but this is not something you want to be doing. For a few reasons:
Your MapReduce job is going to hammer your database as a single performance chokepoint (you are basically defeating the purpose of Hadoop).
With speculative execution, you'll have the same data get loaded up twice so some unique identifiers won't exist when one of the tasks gets killed.
I think if you can conceivably hit the database once per record, you can just do this surrogate key enrichment without MapReduce in a single thread.
Either way, building surrogate keys or automatic counters is not easy in Hadoop because of the shared-nothing nature of the thing.
I'm implementing a web - based application using silverlight with an SQL Server DB on the back end for all the data that the application will display. I want to ensure that the application can be easily scalable and I feel the direction to go in with this is to make the database loosely coupled and not to tie everything up with foreign keys. I've tried searching for some examples but to no avail.
Does anyone have any information or good starting points/samples/examples to help me get off the ground with this?
Help greatly appreciated.
Kind regards,
I think you're mixing up your terminology a bit. "Loosely coupled" refers to the desirability of having software components that aren't so dependent upon each other that they can't function or even compile without being together in the same program. I've never seen the term used to describe the relationships between tables in the same database.
I think if you search on the terms "normalization" and "denormalization" you'll get better results.
Unless you're doing massive amounts of inserts at a time, like with a data warehouse, use foreign keys. Normalization scales like crazy, and you should take advantage of that. Foreign keys are fast, and the constraint really only holds you back if you're inserting millions upon millions of records at a time.
Make sure that you're using integer keys that have a clustered index on them. This should make joining table very rapid. The issues you can get yourself wrapped around without foreign keys are many and frustrating. I just spent all weekend doing so, and we made a conscious choice to not have foreign keys (we have terabytes of data, though).
Before you even think of such a thing, you need to think about data integrity. Foreign keys exist so that you cannot put records into tables if the primary data they are based on is not there. If you do not use foreign keys, you will sooner or later (probably sooner) end up with worthless data because you don't really know who the customer is that the order is attached to for instance. Foreign keys are data protection, you should never consider not using them.
And even though you think all your data will come from your application, in real life, this is simply not true. Data gets in from multiple applications, from imports of large amounts of data, from the query window (think about when someone decides to update all the prices they aren't going to do that one price at a time from the user interface). Data can get into database from many sources and must be protected at the database level. To do less is to put your entire application and data at risk.
Intersting comment about database security when data is input through external sources like database scripts.