Database Structure for hierarchical data with horizontal slices - database

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?

Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.

Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.

there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

Related

Reaching an appropriate balance between performance and scalability in a large database

I'm trying to determine which of the many database models would best support probabilistic record comparison. Specifically, I have approximately 20 million documents defined by a variety of attributes (name, type, author, owner, etc.). Text attributes dominate the data set, yet there are still plenty of images. Read operations are the most crucial vis-a-vis performance, but I expect roughly 20,000 new documents to insert each week. Luckily, insert speed does not matter at all, and I am comfortable queuing the incoming documents for controlled processing.
Database queries will most typically take the following forms:
Find documents containing at least five sentences that reference someone who'a a member of the military
Predict whether User A will comment on a specific document written by User B, given User A's entire comment history
Predict an author for Document X by comparing vocabulary, word ordering, sentence structure, and concept flow
My first thought was to use a simple document store like, like MongoDB, since each document does not necessarily contain the same data. However, complex queries effectively degrade this to a file system wrapper, since I cannot construct a query yielding the results I desire. As such, this approach corners me into walking the entire database and processing each file separately. Although document stores scale well horizontally, the benefits are not realized here.
This led me to realize that my granularity isn't at the document level, but rather the entity-relationship level. As such, graph databases seemed like logical choice, since they facilitate relating each word in a sentence to the next word, next paragraph, current paragraph, part of speech, etc. Graph databases limit data replication, increase the speed of statistical clustering, and scale horizontally, among other things. Unfortunately, ensuring a definitive answer to your query still necessitates traversing the entire graph. Even still, indexing will help with performance.
I've also evaluated the use of relational databases, which are very efficient when designed properly (i.e., by avoiding unnecessary normalization). A relational database excels at finding all documents authored by User A, but fails at structural comparisons (which involves expensive joins). Relational databases also enforce constraints (primary keys, foreign keys, uniqueness, etc.) efficiently--a task with which some NoSQL solutions struggle.
After considering the above-listed requirements, are there any database models that combine the "exactness" of relational models (viz., efficient exhaustion of the domain) with the flexibility of graph databases?
This is not really an answer, just a discussion.
The database you are talking about is a large database. You don't mention the nature of the documents, but newspaper articles are typically in the 2-3k range, so you are talking about hundreds of gigabytes of raw data.
If query performance is an issue, you are talking about a large, rather expensive system.
Your requirements are also quite complex, and not likely to be out-of-the-box. I would be thinking of a hybrid system. Store the document metadata in a relational database system, so you can quickly access them with simple queries. You can store the documents themselves in the database as blobs.
Some of your requirements can be met with text-add ins on relational databases. So, simple searching is feasible using inverted index technology. That handles the first of your three scenarios.
The other two are much more challenging. The third ("predict an author") can probably be handled by having a parallel system that stores author information, summarized from the documents when they are loaded. Then it is a question of comparing a document to the author, using simple statistical analysis (naive Bayesian, anyone?).
The middle one is tricky, but it suggests yet another component for managing comments on documents. Depending on the volume, this may be easy or hard.
Finally, how changing are the requirements? Do you really know what the system should be doing? Or will the functionality be radically different once you get it up and running?

Fast read-only embedded "database"?

I'm looking to distribute some information to different machines for efficient and extremely fast access without any network overhead. The data exists in a relational schema, and it is a requirement to "join" on relations between entities, but it is not a requirement to write to the database at all (it will be generated offline).
I had alot of confidence that SQLite would deliver on performance, but RDMBS seems to be unsuitable at a fundamental level: joins are very expensive due to cost of index lookups, and in my read-only context, are an unnecessary overhead, where entities could store direct references to each other in the form of file offsets. In this way, an index lookup is switched for a file seek.
What are my options here? Database doesn't really seem to describe what I'm looking for. I'm aware of Neo4j, but I can't embed Java in my app.
TIA!
Edit, to answer the comments:
The data will be up to 1gb in size, and I'm using PHP so keeping the data in memory is not really an option. I will rely on the OS buffer cache to avoid continually going to disk.
Example would be a Product table with 15 fields of mix type, and a query to list products with a certain make, joining on a Category table.
The solution will have to be some kind of flat file. I'm wondering if there already exists some software that meets my needs.
#Mark Wilkins:
The performance problem is measured. Essentially, it is unacceptable in my situation to replace a 2ms IO bound query to Memcache with an 5ms CPU bound call to SQLite... For example, the categories table has 500 records, containing parent and child categories. The following query takes ~8ms, with no disk IO: SELECT 1 FROM categories a INNER JOIN categories B on b.id = a.parent_id. Some simpler, join-less queries are very fast.
I may not be completely clear on your goals as to the types of queries you are needing. But the part about storing file offsets to other data seems like it would be a very brittle solution that is hard to maintain and debug. There might be some tool that would help with it, but my suspicion is that you would end up writing most of it yourself. If someone else had to come along later and debug and figure out a homegrown file format, it would be more work.
However, my first thought is to wonder if the described performance problem is estimated at this point or actually measured. Have you run the tests with the data in a relational format to see how fast it actually is? It is true that a join will almost always involve more file reads (do the binary search as you mentioned and then get the associated record information and then lookup that record). This could take 4 or 5 or more disk operations ... at first. But in the categories table (from the OP), it could end up cached if it is commonly hit. This is a complete guess on my part, but in many situations the number of categories is relatively small. If that is the case here, the entire category table and its index may stay cached in memory by the OS and thus result in very fast joins.
If the performance is indeed a real problem, another possibility might be to denormalize the data. In the categories example, just duplicate the category value/name and store it with each product record. The database size will grow as a result, but you could still use an embedded database (there are a number of possibilities). If done judiciously, it could still be maintained reasonably well and provide the ability to read full object with one lookup/seek and one read.
In general probably the fastest thing you can do at first is to denormalize your data thus avoiding JOINs and other mutli-table lookups.
Using SQLite you can certainly customize all sorts of things and tailor them to your needs. For example, disable all mutexing if you're only accessing via one thread, up the memory cache size, customize indexes (including getting rid of many), custom build to disable unnecessary meta data, debugging, etc.
Take a look at the following:
PRAGMA Statements: http://www.sqlite.org/pragma.html
Custom Builds of SQLite: http://www.sqlite.org/custombuild.html
SQLite Query Planner: http://www.sqlite.org/optoverview.html
SQLite Compile Options: http://www.sqlite.org/compile.html
This is all of course assuming a database is what you need.

Speed, CouchDB views and alternatives

I have large data set, which I want to query. The query does not change but the underlying data does. From what I read, I could construct a "view" and query it. Also, I read that Couch DB knows how to update the view when data is changed so I assume querying the view again would be still fast.
My questions are, do I understand CounchDB's views correctly? I don't need any other feature of CouchDB, I don't even need SQL, all I want is fast same query over changing data. Could I use something else? If I would use, say, good old MySQL would it really be slower than CouchDB (read: in the above scenario, how would various DBs approximately perform?).
Your assessment is completely correct. Enjoy!
The only performance trick worth mentioning is that you may see a boost if you emit() all of the data you need from the view and never use the ?include_docs feature, because include_docs causes CouchDB to go back into the main database and retrieve the original doc that caused that view row. In other words, you can emit() everything you need into your view index (more space but faster), or you can use the reference back to the original document (less space but slower.)
I don't think anyone can answer your question given the information you have provided.
Indexes in a relational database are analogous to CouchDB views. In both cases, they store a pre-sorted instance of the data and the database keeps that instance in sync with the canonical data. Both types of database transparently use the index/view to speed up subsequent queries of the form that the index/view was designed for.
Without indexes/views, queries must scan the whole collection of n records of data and they execute in O(n) time. When a query benefits from an indexes/views, it executes in O(log n) time.
But that's speaking very broadly of the performance curve with respect to the volume of data. A given database could have such speedy performance in certain cases that it out-performs another product no matter what. It's hard to make generalizations that brand X is always faster than brand Y. The only way to be sure about a specific case is to try that case in both databases and measure the performance.

Oracle Multiple Schemas Aggregate Real Time View

All,
Looking for some guidance on an Oracle design decision I am currently trying to evaluate:
The problem
I have data in three separate schemas on the same oracle db server. I am looking to build an application that will show data from all three schemas, however the data that is shown will be based on real time sorting and prioritisation rules that is applied to the data globally (i.e.: based on the priority weightings applied I may pull back data from any one of the three schemas).
Tentative Solution
Create a VIEW in the DB which maintains logical links to the relevant columns in the three schemas, write a stored procedure which accepts parameterised priority weightings. The application subsequently calls the stored procedure to select the ‘prioritised’ row from the view and then queries the associated schema directly for additional data based on the row returned.
I have concerns over performance where the data is being sorted/ prioritised upon each query being performed but cannot see a way around this as the prioritisation rules will change often. We are talking of data sets in the region of 2-3 million rows per schema.
Does anyone have alternative suggestions on how to provide an aggregated and sorted view over the data?
Querying from multiple schemas (or even multiple databases) is not really a big deal, even inside the same query. Just prepend the table name with the schema you are interested in, as in
SELECT SOMETHING
FROM
SCHEMA1.SOME_TABLE ST1, SCHEMA2.SOME_TABLE ST2
WHERE ST1.PK_FIELD = ST2.PK_FIELD
If performance becomes a problem, then that is a big topic... optimal query plans, indexes, and your method of database connection can all come into play. One thing that comes to mind is that if it does not have to be realtime, then you could use materialized views (aka "snapshots") to cache the data in a single place. Then you could query that with reasonable performance.
Just set the snapshots to refresh at an interval appropriate to your needs.
It doesn't matter that the data is from 3 schemas, really. What's important to know is how frequently the data will change, how often the criteria will change, and how frequently it will be queried.
If there is a finite set of criteria (that is, the data will be viewed in a limited number of ways) which only change every few days and it will be queried like crazy, you should probably look at materialized views.
If the criteria is nearly infinite, then there's no point making materialized views since they won't likely be reused. The same holds true if the criteria itself changes extremely frequently, the data in a materialized view wouldn't help in this case either.
The other question that's unanswered is how often the source data is updated, and how important is it to have the newest information. Frequently updated source day can either mean a materialized view will get "stale" for some duration or you may be spending a lot of time refreshing the materialized views unnecessarily to keep the data "fresh".
Honestly, 2-3 million records isn't a lot for Oracle anymore, given sufficient hardware. I would probably benchmark simple dynamic queries first before attempting fancy (materialized) view.
As others have said, querying a couple of million rows in Oracle is not really a problem, but then that depends on how often you are doing it - every tenth of a second may cause some load on the db server!
Without more details of your business requirements and a good model of your data its always difficult to provide good performance ideas. It usually comes down to coming up with a theory, then trying it against your database and accessing if it is "fast enough".
It may also be worth you taking a step back and asking yourself how accurate the results need to be. Does the business really need exact values for this query or are good estimates acceptable
Tom Kyte (of Ask Tom fame) always has some interesting ideas (and actual facts) in these areas. This article describes generating a proper dynamic search query - but Tom points out that when you query Google it never tries to get the exact number of hits for a query - it gives you a guess. If you can apply a good estimate then you can really improve query performance times

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources