I am planning to setup a database for some rudimentary aggregations. The plan is to provide queries like SELECT SUM(energy) WHERE... to our users.
energy is a plain numerical field. The WHERE clause is more interesting, because we expose some (limited) customizability to the user, basically just ANDing some fields for equality. Like A=3, A=3 AND B=92, etc.
I'm no DBA, but my performance sense is tingling. As it stands, I'm anticipating a load of O(user * record) on the database, should the queries be fired off all at once. Is there a way to optimize this better?
If the WHERE condition were fixed, then we could simply provide a view or otherwise precalculate and cache the sum. Unfortunately, in this case, we will be providing limited ability to customize the WHERE expression, basically offering some fields for the users to AND at will.
Looks to me like each of these aggregation queries would traverse basically the whole table, or significant subsections of the table, one traversal per user query. Does that make sense?
What are some ways to optimize for this kind of workflow? Should I just shard over many of my fields? I'm considering how many replicas to use, though I am unsure if |replicas| would be able to outpace growth of users or data, due to the data totality of the aggregation queries.
In terms of low-level performance, would it make sense to structure these queries as SELECT SUM(energy)>N WHERE..., and hope that PostgreSQL is smart enough to terminate early when the subtotal is found to already exceed the threshold N?
Finally, would a NoSQL or TSDB offer advantages for this workflow, or would their performance be comparable to a SQL database?
Update
Since most of these queries will be run on a schedule, I think I will stagger them to spread the load throughout the day. But I'm still keen on finding better ways to optimize the tables for this load, should a bunch of active users suddenly submit aggregation queries all at once.
Related
I have red about anchor modelling from http://www.anchormodeling.com/ - there are a lot of publications that made sense to me. I am very concerned about the performance though... storing so many records in a property table and always working with the most recent one should drain memory and processor speed. The authors claim that this is not the case though.... Is there any better modelling technique to store history and allow roll-back of the records?
Normally querying data in an Anchor modeled database falls within two categories:
OLTP-like queries, retrieving a large number of attributes using high selectivity conditions
OLAP-like queries, retrieving a small number of attributes using low selectivity conditions
In (1) the high selectivity, often constraining the results to that belonging to a single instance, will quickly pinpoint the desired instance followed by a small overhead due to the joins involved. The joins are however made on declared PK/FK relations on over tables already sorted by the single integer key corresponding to the identity of the instance. In other words, in a 6NF model (which is what provides the most temporal features), it is not possible to create a physical implementation that would perform better. As a case example, the Swedish insurance company Länsförsäkringar has been running a real-time master data management system using Anchor since 2005, containing about 10 million engagements for 3 million customers, without performance issues. That being said, if extremely many queries are going to be run in parallel, the added overhead may become an issue.
In (2) since you are retrieving a small number of attributes the number of joins are reduced. In addition, the selectivity introduced by conditions make the joins behave like indexes (provided you have cost based optimizer that use column statistics). An optimal join order will be produced using the most selective condition first, so that intermediate result sets become as small as possible as early as possible with respect to the involved joins. As an additional benefit, the 6NF structure in Anchor maps directly onto distribution mechanisms in massively parallel processing relational databases, providing the best possible distribution for ad-hoc querying. As a case example, avito.ru has a 55TB data warehouse built using Anchor on a 12 node Vertica cluster, running without performance issues. In fact, this solution outperformed many of the other solutions they tested, including NoSQL alternatives.
As a conclusion, I would say that you cannot find a better modeling technique if you need to support temporality and flexibility. I have to point out though that I am one of the authors of the technique, although what I have said has been proven both in practice and theory, with scientific papers to back up the claims.
We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?
Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.
Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.
there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
I have large data set, which I want to query. The query does not change but the underlying data does. From what I read, I could construct a "view" and query it. Also, I read that Couch DB knows how to update the view when data is changed so I assume querying the view again would be still fast.
My questions are, do I understand CounchDB's views correctly? I don't need any other feature of CouchDB, I don't even need SQL, all I want is fast same query over changing data. Could I use something else? If I would use, say, good old MySQL would it really be slower than CouchDB (read: in the above scenario, how would various DBs approximately perform?).
Your assessment is completely correct. Enjoy!
The only performance trick worth mentioning is that you may see a boost if you emit() all of the data you need from the view and never use the ?include_docs feature, because include_docs causes CouchDB to go back into the main database and retrieve the original doc that caused that view row. In other words, you can emit() everything you need into your view index (more space but faster), or you can use the reference back to the original document (less space but slower.)
I don't think anyone can answer your question given the information you have provided.
Indexes in a relational database are analogous to CouchDB views. In both cases, they store a pre-sorted instance of the data and the database keeps that instance in sync with the canonical data. Both types of database transparently use the index/view to speed up subsequent queries of the form that the index/view was designed for.
Without indexes/views, queries must scan the whole collection of n records of data and they execute in O(n) time. When a query benefits from an indexes/views, it executes in O(log n) time.
But that's speaking very broadly of the performance curve with respect to the volume of data. A given database could have such speedy performance in certain cases that it out-performs another product no matter what. It's hard to make generalizations that brand X is always faster than brand Y. The only way to be sure about a specific case is to try that case in both databases and measure the performance.
I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...