I have a medium size table, 50M records or so, capturing all property sales in a geographic region. The initial thinking was to have a composite (multi?) index on the heavily queried fields, date (day precision), latitude (high precision), longitude (high precision) and price. Typical queries provide range values for all of these columns. I am really struggling to logically understand how range queries work on multiple columns of type numeric (lat/long in this case). Our data has a lot of unique values for lat & long and in my mind there would be huge fanout in the index. This image displays the fanout, just imagine thousands of unique values for both latitude and longitude
My question is, have DB indices come a long way and can handle this much better than my logical understanding of the problem? If not, I would think the index could be much more performant if I introduced coarse values for date (eg month), lat & long (maybe use their integer value) to minimize the fanout. the query would have a much more verbose where clause (eg month >= A and month < B and date >= C and date < D). The index would filter on month, and the db would further filter on date (unindexed). Again, just wondering if there is merit in this approach, or if the latest indices handle this by default.
We did try the index on the fine level values as mentioned in the description, hoping to realize better performance than what we are seeing.
I created a few test indices to try and produce different scenarios, it looks like, at least anecdotally, that for the queries I'm trying to execute, the db selects the fine grained indices. I was thinking / hoping that by making the keys more coarse grained the range scan can be optimized across the first few keys, but it looks like that doesn't happen. From my test data, if the first key can filter most of the data, it can further filter records as it traverses the index on the first key. No real benefit in trying to optimize this further, at least I couldn't get the better performing index.
Related
I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.
I'm looking for a way to store data with a timestamp.
Each timestamp might have 1 to 10 data fields.
Can I store data as (time, key, value) using a simple data solution or SQL? how would that compare to noSQL solution like mongo, where I can store {time:.., key1:..., key2:...}?
It will store about 10 data points with max around 10 fields per second. And the data might be collected as long as 10years, easily aggregating a billion records. The database should be able to help graphing data with time range queries.
It should be able to handle heavy writing frequency, ~100 per second (ok, this is not that high, but still..), at the same time being able to handle queries that return about a million of records (maybe even more)
Data it self is very simple, they are just electronic measurements. Some need to be measured with a high frequency(~100 milliseconds), and others every 1 min or so.
Can anyone who used something like this comment on the pluses and minuses of the method they used?
(Obviously this is a very specific scenario, so this definitely is not intended to turn in to what's the best database kind of question).
Sample data:
{ _id: Date(2013-05-08 18:48:40.078554),
V_in: 2.44,
I_in: .00988,
I_max: 0.11,
},
{_id: Date(2013-05-08 18:48:40.078325),
I_max: 0.100,
},
{ _id: Date(2001-08-09 23:48:43.083454),
V_out: 2.44,
I_in: .00988,
I_max: 0.11,
},
Thank you.
For simplicity, I would just make a table of timestamps with a column for each measurement point, and an integer primary key would be technically redundant since the timestamp uniquely identifies a measurement point, however it's easier to refer to a particular row by number than by timestamp. You will have nulls for any measured parameter that was not taken during that timestamp, which will take up a few extra bits per row (log base 2 of number of columns, rounded up), but you also won't have to do any joins. It is true if you decide you want to add columns later, but that's really not too difficult, and you could just make another separate table that keys on this one.
Please see here for an example with your data: http://www.sqlfiddle.com/#!2/e967c/4
I would recommend making some dummy databases of large size to make sure whatever structure you use still performs adequately.
The (time,key,value) suggestion smells like EAV, which I would avoid if you're planning on scaling.
We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.
What is the best way to store DateTime in SQL Server to provide maximum search speed for large table? The table contains the records, and one row has to contain the date and time.
The searches are like
Value > '2008-1-1' AND Value < '2009-1-1'
or
Value > '2008-1-1' AND Value < '2008-31-1'
etc.
Which is the best? The DateTime with index? The unixstamp in long with index? Multiple int fields like year, month, day ect.? Or something else?
I'd use the smallest datatype that supports what level of datetime accuracy you need.
e.g.
datetime2 if you need high accuracy down to 100 nanoseconds (6-8 bytes) (SQL 2008+)
datetime for accuracy to 3.33ms (8 bytes)
smalldatetime for accuracy to the minute (4 bytes)
date for accuracy to the day (no time stored, 3 bytes) (SQL 2008+)
You don't mention how large a table you are talking. But there are strategies for dealing with improving query performance on top of standard indexing strategy (e.g. partitioning, filtered indices)
Your example shows only date and no time element.
If you only need date then use the relatively new DATE type instead of DATETIME.
It is smaller and with an index should be fast.
If you choose smaller data type, you can store more records on a single page, so there will be less IO operations, so it will work faster.
Of cource, introducing indexes may improve performance. Indexes should include as little columns as possible, to store maximum ammount of records on a single page.
But...
Premature optimization is root all of evil
For the first you should store date as date, considering precision you need. If later you face some performance issues, you have to investigate which queries have performance issues and may be introduce some indexes, but you have to make sure those indexes cover you queries.
I have a scenario in which there's a huge amount of status data about an item.
The item's status is updated from minute to minute, and there will be about 50,000 items in the near future. So that, in one month, there will be about 2,232,000,000 rows of data. I must keep at least 3 months in the main table, before archieving older data.
I must plan to achieve quick queries, based on a specific item (its ID) and a data range (usually, up to one month range) - e.g. select A, B, C from Table where ItemID = 3000 and Date between '2010-10-01' and '2010-10-31 23:59:59.999'
So my question is how to design a partitioning structure to achieve that?
Currently, I'm partitioning based on the "item's unique identifier" (an int) mod "the number of partitions", so that all partitions are equally distributed. But it has the drawback of keeping one additional column on the table to act as the partition column to the partition function, therefore, mapping the row to its partition. All that add a little bit of extra storage. Also, each partition is mapped to a different filegroup.
Partitioning is never done for query performance. With partitioning the performance will always be worse, the best you can hope for is no big regression, but never improvement.
For query performance, anything a partition can do, and index can do better, and that should be your answer: index appropriately.
Partitioning is useful for IO path control cases (distribute on archive/current volumes) or for fast switch-in switch-out scenarios in ETL loads. So I would understand if you had a sliding window and partition by date so you can quickly switch out the data that is no longer needed to be retained.
Another narrow case for partitioning is last page insert latch contention, like described in Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads.
Your partition scheme and use case does not seem to fit any of the scenarios in which it would benefit (maybe is the last scenario, but is not clear from description), so most likely it hurts performance.
I do not really agree with Remus Rusanu. I think the partitioning may improve performance if there's a logical reason (related to your use cases). My guess is that you could partition ONLY on the itemID. The alternative would be to use the date as well, but if you cannot predict that a date range will not cross the boundaries of a given partition (no queries are sure to be with a single month), then I would stick to itemId partitioning.
If there are only a few items you need to compute, another option is to have a covering index: define an INDEX on you main differentiation field (the itemId) which INCLUDEs the fields you need to compute.
CREATE INDEX idxTest ON itemId INCLUDE quantity;
Applicative partitioning actually CAN be beneficial for query performance. In your case you have 50K items and 2G rows. You could for example create 500 tables, each named status_nnn where nnn is between 001 and 500 and "partition" your item statuses equally among these tables, where nnn is a function of the item id. This way, given an item id, you can limit your search a priori to 0.2% of the whole data (ca. 4M rows).
This approach has a lot of disadvantages, as you'll probably have to deal with dynamic sql and a other unpleasant issues, especially if you need to aggregate data from different tables. BUT, it will definitely improve performance for certain queries, s.a. the ones you mention.
Essentially applicative partitioning is similar to creating a very wide and flat index, optimized for very specific queries w/o duplicating the data.
Another benefit of applicative partitioning is that you could in theory (depending on your use case) distribute your data among different databases and even different servers. Again, this depends very much on your specific requirements, but I've seen and worked with huge data sets (billions of rows) where applicative partitioning worked very well.