This query is regarding partitioning in hive/delta tables.
Which column should we pick for partitioning the table if the table is always being used to join based on key which have only unique value.
Ex: we have a table Customer(id, name, otherDetails)
which field be suitable to partition this table.
Thanks,
Deepak
Good question. Below are factors you need to consider while partitioning -
Requirement - when you have lots of data, heavily used table, frequently data added to it, and you want to manage it better.
Distribution of data - choose a field or fields on which data is evenly distributed. Most common is date or month or year, and normally transactional data is somewhat evenly distributed on these fields. You can choose something like country or region as well to partition when you have data evenly distributed.
Loading strategy - You can load/insert/delete each partition separately. So choose some columns which will help you deciding a better strategy. Like you can choose to delete old data based on date every time you load. so chose load date as partition.
Reasonable number of partitions - Make sure you do not have thousands of partitions but less that 500 is good ( check your systems performance).
Do not choose unique key/composite key as partition key. because hive creates folders with data files for each partition and it will be very difficult to manage thousands of partitions.
Related
I have an sql datadas, where among other things I have a prices table, where I have one price per product per store.
There are 50 stores and over 500000 products, so this table Will easily have 25 to 30 million records.
This table is feed daily over night with prices updates, and has huge read operations during day. Reads are made with readonly intent.
All queries contain storeid as part of identifying the record to update or read.
I m not able yet to determine how this Will behave since I m expecting external supply of prices but I m expecting performance issues at least on read operations, even though indexes are in place for now...
My question is if I should consider table partition by store since it is always part of queries. But then I have indexes where storeid is not the only column that is part of the index.
Based on this scenario, would you recommend partitioning? The alternative I see is having 50 tables one per store, but it seems painless and if possible to avoid the better
if I should consider table partition by store since it is always part of queries
Yes. That sounds promising.
But then I have indexes where storeid is not the only column that is part of the index.
That's fine. So long as the partitioning column is one of the clustered index columns, you can partition by it. In fact with partitioning, you can get partition elimination for a trailing column of the clustered index, then a clustered index seek within the target partition.
Hi all and thank you for your replies.
I was able to generate significant information on a contained environment where I was able to confirm that I can achieve excelent performance indicators by using only the appropriate indexes.
So for now we will keep it "as is" and have the partition strategy on hand just in case.
Thanks again, nice tips guys
Designing an oracle database for an ordering system. Each row will be a schedule that stores can be assigned that designates if/when they will order from a specific vendor for each day of the week.
It will be keyed by vendor id and a unique schedule id. Started out with those columns, and then a column for each day of the week like TIME_SUN, TIME_MON, TIME_TUE... to contain the order time for each day.
I'm normally inclined to try and normalize data and have another table referencing the schedule id, with a column like DAY_OF_WEEK and ORDER_TIME, so potentially 7 rows for the same data.
Is it really necessary for me to do this, or is it just over complicating what can be handled as a simple single row?
Normalization is the best way. Reasons:
The table will act as a master table
The table can be used for reference in future needs
It will be costly to normalize later
If there are huge number of rows with repeating more column values then database size growth is unwanted
Using master table will limit redundant data only to the foreign key
Normalization would be advisable. In future if you are required to store two or more order times for the same day then just adding rows in your vendor_day_order table will be required. In case you go with the first approach you will be required to make modifications to your table structure.
Here is the problem, I have a sales information table which contains sales information, which has columns like (Primary Key ID, Product Name, Product ID, Store Name, Store ID, Sales Date). I want to do analysis like drill up and drill down on store/product/sales date.
There are two design options I am thinking about,
Create individual index on columns like product name, product ID, Store Name, Store ID, Sales Date;
Using data warehouse snowflake model, treating current sales information table as fact table, and create product, store, and sales date dimension table.
In order to have better analysis performance, I heard snowflake model is better. But why it is better than index on related columns from database design perspective?
thanks in advance,
Lin
Knowing your app usage patterns and what you want to optimize for are important. Here are a few reasons (among many) to choose one over the other.
Normalized Snowflake PROs:
Faster queries and lower disk and memory requirements. Due to each normalized row having only short keys rather than longer text fields, your primary fact table becomes much smaller. Even when an index is used (unless the query can be answered directly by the index itself), partial table scans are often required, and smaller data means fewer disk reads and faster access.
Easier modifications and better data integrity. Say a store changes its name. In snowflake, you change one row, whereas in a large denormalized table, you have to change it every time it comes up, and you will often end up with spelling errors and multiple variations of the same name.
Denormalized Wide Table PROs:
Faster single record loads. When you most often load just a single record or small number of records, having all your data together in one row will incur only a single cache miss or disk read, whereas in the snowflake the DB might have to read from multiple tables in different disk locations. This is more like how NoSQL databases store their "objects" associated with a key.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.
Microsoft in its MSDN entry about altering SQL 2005 partitions, listed a few possible approaches:
Create a new partitioned table with the desired partition function, and then insert the data from the old table into the new table by using an INSERT INTO...SELECT FROM statement.
Create a partitioned clustered index on a heap
Drop and rebuild an existing partitioned index by using the Transact-SQL CREATE INDEX statement with the DROP EXISTING = ON clause.
Perform a sequence of ALTER PARTITION FUNCTION statements.
Any idea what will be the most efficient way for a large scale DB (millions of records) with partitions based on the dates of the records (something like monthly partitions), where data spreads over 1-2 years?
Also, if I mostly access (for reading) recent information, will it make sense to keep a partition for the last X days, and all the rest of the data will be another partition? Or is it better to partition the rest of the data too (for any random access based on date range)?
I'd recommend the first approach - creating a new partitioned table and inserting into it - because it gives you the luxury of comparing your old and new tables. You can test query plans against both styles of tables and see if your queries are indeed faster before cutting over to the new table design. You may find there's no improvement, or you may want to try several different partitioning functions/schemes before settling on your final result. You may want to partition on something other than date range - date isn't always effective.
I've done partitioning with 300-500m row tables with data spread over 6-7 years, and that table-insert approach was the one I found most useful.
You asked about how to partition - the best answer is to try to design your partitions so that your queries will hit a single partition. If you tend to concentrate queries on recent data, AND if you filter on that date field in your where clauses, then yes, have a separate partition for the most recent X days.
Be aware that you do have to specify the partitioned field in your where clause. If you aren't specifying that field, then the query is probably going to hit every partition to get the data, and at that point you won't have any performance gains.
Hope that helps! I've done a lot of partitioning, and if you want to post a few examples of table structures & queries, that'll help you get a better answer for your environment.