I have a scenario in which there's a huge amount of status data about an item.
The item's status is updated from minute to minute, and there will be about 50,000 items in the near future. So that, in one month, there will be about 2,232,000,000 rows of data. I must keep at least 3 months in the main table, before archieving older data.
I must plan to achieve quick queries, based on a specific item (its ID) and a data range (usually, up to one month range) - e.g. select A, B, C from Table where ItemID = 3000 and Date between '2010-10-01' and '2010-10-31 23:59:59.999'
So my question is how to design a partitioning structure to achieve that?
Currently, I'm partitioning based on the "item's unique identifier" (an int) mod "the number of partitions", so that all partitions are equally distributed. But it has the drawback of keeping one additional column on the table to act as the partition column to the partition function, therefore, mapping the row to its partition. All that add a little bit of extra storage. Also, each partition is mapped to a different filegroup.

Partitioning is never done for query performance. With partitioning the performance will always be worse, the best you can hope for is no big regression, but never improvement.
For query performance, anything a partition can do, and index can do better, and that should be your answer: index appropriately.
Partitioning is useful for IO path control cases (distribute on archive/current volumes) or for fast switch-in switch-out scenarios in ETL loads. So I would understand if you had a sliding window and partition by date so you can quickly switch out the data that is no longer needed to be retained.
Another narrow case for partitioning is last page insert latch contention, like described in Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads.
Your partition scheme and use case does not seem to fit any of the scenarios in which it would benefit (maybe is the last scenario, but is not clear from description), so most likely it hurts performance.

I do not really agree with Remus Rusanu. I think the partitioning may improve performance if there's a logical reason (related to your use cases). My guess is that you could partition ONLY on the itemID. The alternative would be to use the date as well, but if you cannot predict that a date range will not cross the boundaries of a given partition (no queries are sure to be with a single month), then I would stick to itemId partitioning.
If there are only a few items you need to compute, another option is to have a covering index: define an INDEX on you main differentiation field (the itemId) which INCLUDEs the fields you need to compute.
CREATE INDEX idxTest ON itemId INCLUDE quantity;

Applicative partitioning actually CAN be beneficial for query performance. In your case you have 50K items and 2G rows. You could for example create 500 tables, each named status_nnn where nnn is between 001 and 500 and "partition" your item statuses equally among these tables, where nnn is a function of the item id. This way, given an item id, you can limit your search a priori to 0.2% of the whole data (ca. 4M rows).
This approach has a lot of disadvantages, as you'll probably have to deal with dynamic sql and a other unpleasant issues, especially if you need to aggregate data from different tables. BUT, it will definitely improve performance for certain queries, s.a. the ones you mention.
Essentially applicative partitioning is similar to creating a very wide and flat index, optimized for very specific queries w/o duplicating the data.
Another benefit of applicative partitioning is that you could in theory (depending on your use case) distribute your data among different databases and even different servers. Again, this depends very much on your specific requirements, but I've seen and worked with huge data sets (billions of rows) where applicative partitioning worked very well.


Large partition count performance impact

I have tables have millions of partitions.
Should I reduce partition count for performance?
As my experience of spark application or hive query system, too many partition was bad for performance.
If you do not have auto clustering on the table, it will not be auto defragmented. So if you write to the table frequently with small row counts, it will be in very bad shape.
Partition count impacts compile time badly, as every partition has metadata that is load to plan/optimize the query. I would suggest doing a rebuild test (select into a new transient table) and run some comparable queries to see the different in compile time.
We have a number of table that sorting (thus auto clustering) does not make sense for as the use pattern is always full-table scan, thus we just rebuild those tables on schedules to keep the partition count down, and for us, that rebuild cost is worth the performance gain.
As with everything Snowflake you should run a test, and see how it is for you. And monitor hot spots as they can and do change.
In Snowflake, there are micro-partitions, and they are managed automatically. Therefore you do not need to worry about the number of micro-partition.
It says:
Micro-partitioning is automatically performed on all Snowflake tables.
Tables are transparently partitioned using the ordering of the data as
it is inserted/loaded.
From this page, I understand that micro-partitions are managed by Snowflake, and you do not need to focus on reducing the partition count (this is the original question).
This should also help to understand the difference between clustering and micro-partitions:
If you read the above link, you can see that it is not a must to define clustering on even large tables to get a good query performance!
As the original question about reducing the partition count, I also have to say that clustering does not always reduce the number of partitions, but it is another story.

Performance of Column Family in Cassandra DB

I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.

WHERE clause vs Smaller table

Is there a meaningful difference (or a rule of thumb for a given table size) for query time of a table with a WHERE clause limiting the result set compared to a smaller table which is equal to the size of the post-WHERE, limited result set?
For example:
Your table has records with timestamps spanning many years. You run a query that contains a WHERE clause limiting your result to the last 10 days only.
Your table has only 10 days of data, and you run the same query as above (obviously without the WHERE clause since it's not necessary in this case).
Should I expect a query performance difference in the two scenarios above? Note that I'm using Redshift. Obviously there is a $$ cost savings of storing less data, which is one benefit of scenario 2. Any others?
It depends entirely on the table and the indexes (in case of Redshift the Sort Key). Traditionally if you have a descending index on the timestamp and use the timestamp on the where clause, then the query engine will pretty quickly find the records it needs and stop looking.
There may still be some benefit from having less records, perhaps even maintaining two tables, but duplicating data should be a very last resort if testing shows that the performance benefit is real and necessary.
In Redshift, The answer is yes, it is always quicker to query a smaller table rather than a where clause on a larger table. This is because Redshift will generally scan all of the rows in the table. or at least those rows which are not excluded by the distribution/sort key optimisations.
Lets also address the other important aspects of this question
In almost all cases Redshift storage is cheap - that is because storage is usually not the deciding factor when capacity planning a Redshift cluster. It is more about getting the performance you need for the queries that you want to run.
You can improve the performance of Redshift queries in 4 ways
Increase the size of the cluster.
Tune the query.
Alter the definition of the Redshift tables, taking into account
contents and usage patterns. Sort and Distribution keys can make a
big difference. compression types should also be considered.
Implement Redshift performance management, to give priority to
higher priority queries.

Performance of 100M Row Table (Oracle 11g)

We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
constraint pk_huge primary key (claim_key, valuation_date_key)
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
claim_key (fk->Claim.claim_key)
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.

Is it possible to partition more than one way at a time in SQL Server?

I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.
