What is CPBTree in SAP HANA? - database

I'm studying SAP HANA main memory database.
There is index called CPBTree in it. In it's document, it is described as follows:
CPB+-tree stands for Compressed Prefix B+-Tree; this index tree type
is based on pkB-tree. CPB+-tree is a very small index because it uses
'partial key' that is only part of full key in index nodes.
This is a bit vague. There is no other explanation about CPBTree structure on the Internet.
Is there anyone who can explain more or introduce a good document?

Where to begin here?
B-trees are very intensely studied and developed data structures, so pointing to a single document that explains all aspects relevant to this question and SAP HANA is a bit difficult.
Maybe it helps to unpack the term first:
Compressed Prefix
This basically means, the B-tree index and leaf nodes do not contain the full strings for keys. Instead, the parts of the key-strings that are common among the keys (the prefixes) are stored separately. The leaf and index nodes then only contain
the pointer to the prefix
a sort of "delta" that contains the remaining key (this is where the partial key from the pkB-tree comes in)
and a pointer to the data record (row id)
This technique is rather common in many DBMS, usually attached to a feature called "index compression" or something similar.
So, now we know that HANA uses compressed B-tree indexes (for row-store tables and for data that can be expressed as strings).
Why is this important for an in-memory database like HANA?
In short: memory transfer effort between RAM and CPU.
The smaller the index structure, the more of it can fit into the CPU caches. To traverse (go through) the index, fewer back-and-forth movements of data have to be performed.
It's a huge performance advantage.
This is complemented with specific "cache-conscious" index protocols (how the index structure is used by the HANA kernel) that try to minimize the RAM-CPU data transfers.
All this is an overly simplified explanation and I hope that it helps to make more sense nevertheless.
If you want to "dive deeper" and start reading academic papers around that topic then Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems by Prof. Sang K. Cha. et al.
This is the same Sang K. Cha that created P*Time, an in-memory (row-store) DBMS in the early 2000s.
This P*Time has been, rather well-known, acquired by SAP (like so many other DBMS software products companies... Sybase... MaxDB... OrientDB...) and the technology has been used as a research base for what would become SAP HANA.
Nowadays, there is only a small part of P*Time still in SAP HANA and it is mostly reduced to the concepts and algorithms and not so much expressed in actual P*Time code.
All in all, for the user of HANA (developer, admin, data consumer) the specifics of this index implementation hardly matter as none of them can interact with the index structure directly.
What matters is that this index takes modern server systems (many cores, large CPU caches, lots of RAM) and extracts great performance from them, while still allowing for "high-speed" transactions.
I added an extended write-up of this answer to my blog: https://lbreddemann.org/what-is-cpb-tree-in-sap-hana/.

Related

Relational Database schema design for metric storage

Considering a system that has the following characteristics:
Stores time series data/metrics collected from multiple sensors/inputs.
Data points (metrics) are collected from many different systems at different times.
Each of these metrics is generally one data point (e.g. temp and humidity are not reported at the same time, but rather individually and will have a different timestamp)
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
A summary of all metrics for a given time bucket needs to be obtained via a query and it likely to be the most common querying scenario.
I can think of three ways of modeling this.
1. Wide table - with table per category (covered)
Notes: has lots of sparse values due to the data points being collected individually. Storage of new metrics require a new column
2. Narrow table - with table per metric (covered)
Notes: Storage of new metrics require a new table
3. Typed table (not covered) - with single metric table (not covered)
Notes: Storage of new metrics just require a new row in the metricType table, no schema changes. Concerned about performance implications due to chunk size although grouping by a time bucket across all metrics would not require joins and could therefore be faster?
I was wondering if anyone could comment or the options presented, point me to some performance bench marks that include 3 as well as 1 and 2 or generally give any advice on the suitability of each approach. I'm planning to run my own experiments on this and I will post the results when done, but any insight at this stage would be gratefully received. :)
Please note, do not suggest a nosql solution, I'm aware of the options in that space and am assessing that option separately
1 Proposal
"Wide table"
That has gross Normalisation errors (as well as, if taken seriously, it has masses of Nulls and integrity problems). It is unuseable, no further comment is required.
"Narrow table"
That is free of errors, but the Normalisation is not yet complete.
"Typed table"
That is sort of complete, the "best" of your three scenarios. But it views the issue through a narrow lens, and in total isolation from the context in which the issue exists. Thus it is in error for reasons other than those you inquire about.
2 Problem
The first problem is that you are comparing three things which are not reasonably comparable, not reasonably equal to each other.
The second problem is, EAV is the flavour of the month, and many people are attracted to it. However, it has major problems, and requires an additional set of "metadata" tables if it is to be implemented with some data integrity. The point is, EAV is not needed.
3 Solution
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
This is actually a straight-forward Relational database problem, which is solved by a perfectly ordinary Relational design, which provides full Relation Power; Relational Integrity; and Speed (which other designs will not have).
3.1 Caveat
But there are a few caveats, due to the fact that what is marketed as "relational" is not Relational.
Get rid of the Record ID fields, they are anti-Relational.
Record IDs reduce your schema to a 1970's style Record Filing system (located in an SQL container for convenience).
Record IDs do not provide row uniqueness, which is demanded by the Relational Model.
Further, they require one additional field and one additional index per file.
When modelling a database (Relational or not), perceive the data, as data, and nothing but data. Do not view the data in terms of your need re the GUI, or some query or other.
It is an error to concern yourself with performance issues at this (modelling) stage. First get it right. Second, make it fast. Do not reverse the prescribed sequence.
Relational Keys provide meaning, as well as Relational Integrity (which is Logical, and distinct from Referential Integrity, which is a physical facility of SQL). What this addresses is the context in which an object exists.
A Sensor does not exist in isolation (except when it is in a package on a shelf in a shop ... but even then, it exists in the context of the shop inventory)
An active Sensor exists only in the context of the object in which it is housed. You have not provided any info regarding that. Let's call the thing Article as a generic label.
Further, it is the Article that requires a limit on the Metric that is being measured by the Sensor (for the purpose of out-of-range alarms, etc), and not the Sensor itself. (The Sensor may have a range, which is a different thing.)
Likewise, a Sensor exists in a Location, which is a second vector. Or else, the Article exists in a Location, and the Article Key carries the Location. I have modelled the latter.
3.2 Data Model
Here is the solution:
Sensor Data Model
Inline graphics may not show up in some browsers. In that case, here it is in PDF.
It will satisfy both OLTP and OLAP (Dimension-Fact) requirements.
If you provide more context, we can get that modelled precisely. This may take a bit of to-and-fro.
It is limited to the info provided.
I have taken MetricType and SensorType to be synonymous.
Article is shown as Dependent on (exists within) Location, alternately they could be separate vectors. In any case, Article and Location together qualify Sensor.
Since SensorSerialNo is unique (AK2), therefore Reading(SensorSerialNo, DateTime) is unique. An index is not required. However, in the event there are many queries on Reading via SensorSerialNo alone, such an index will boost performance.
Please feel free to ask questions, and I will answer.
For those who are completely new to IDEF1X, refer to IDEF1X Introduction.
For those who are familiar with IDEF1X, and only want a brush-up, refer to IDEF1X Anatomy.
4 Performance
Your concern re performance is good, but far too premature to be applied at this stage. First get the data model right, second get the data structures fast. The reasons for that are many, not the least of which is, when the data is Normalised, Relationally, the structures are already very fast. Further, one should never optimise for a particular query (one can add indices, if necessary, in the second stage).
Nevertheless, I will respond to your stated concerns.
Eg. a ClusteredIndex on the prescribed Reading PK will:
Serve most queries, most Dimensions (except queries that use SensorSerialNo alone, in case of which I have suggested an additional index)
Serve all OLTP Transactions and ensure the highest concurrency, because the Sensors are distributed per the real world: across Locations and Articles`.
Whereas an Index on a Record ID guarantees a HotSpot on every single INSERT. Great for creating Deadlocks.
4.1 Benchmark
I do have a hundred or so benchmarks for data structures such as this, collected over the last four decades for both OLTP & OLAP use. Most of my customers are banks (Think: Sensor Readings are very much like Stock Prices that change over the period of a day; several vectors (Dimensions); billions of rows). Banks are very strict about confidentiality, so I cannot publish the benchmarks as is, and redacting them will take time and effort.
I do have one benchmark for a very similar requirement, that is public. In fact, it was included in an Answer to a SO Question re Time Series data, but the seeker got the moderators to excise it (it is embarrassing to Oracle). Here is the Benchmark Summary for the Sybase ASE vs Oracle 10.2 benchmark on a fixed DDL (Time Series data) and population.
Finally, the structures and code required are simple enough for you to run your own benchmark.
5 Response to Other Answers
Re Neville's comments:
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
Noting that his comments regard EAV, but that it may imply that it applies equally to the subject table (an ordinary Relational database table (non-EAV) Reading) in this case, because it concerns the query type (and not the EAV concept vs the Relational concept):
The declaration does not apply to Relational tables (it may well apply to EAV; the masses of problems introduced due to Record IDs; etc)
As long as you have
a genuine Relational database schema (as I have suggested), and
a genuine SQL platform (not a pretend "sql", which does not comply but fraudulently uses the name), and
you understand IN and NOT IN, and how to compare Sets in SQL
... such queries are straight-forward to code.
6 Response to Comments
Record ID is Anti-Relational
Do you have any links on the record_id being anti-relational, I don't disbelieve you for a second but I'm interested to learn more about why this anti-pattern is so prevalent.
In this mess of anti-science, the academics manufacture and contrive various "solutions" to "problems", that do not exist in the Relational Model, and then you have a second level of endless "debates" about which correction to the non-problem is better or worse.
You don't need links because there is nothing to "debate", and whatever "debate" you might happen to read misses the above point.
The one and only authority is the great Dr E F Codd. All the authors of all books and textbooks alleging to be about the Relational Model, other than Codd, are actually false, they are about implementing 1970's style Record Filing Systems, and anti-Relational (no Relational Power; no Relational Integrity; no Relational Speed). They made the mistake, from 1970, of trying to fit the RM into their 1970's RFS mindset, rather than releasing it and taking on the RM mindset. And they have spent the last FIVE DECADES reinforcing that, even justifying it with "mathematical definitions"; 17 "relational algebras"; 42 abnormal "normal forms". All completely anti-Relational. And they cite each other, so they get published.
The second problem is, sites such as SO are predicated on the basis of populism. The popular answer is not the best or correct answer. For that you need an Authority (very scary to populists), and objective, absolute truth. (People love their relative or subjective "truths", that change all the time).
Therefore, you need just the single, authoritative definition, the original paper, the Relational model.
Yes, the terms are out-dated, and not well understood these days.
Yes, it is seminal (every word counts, has deep meaning).
No, you need not read section 2 (math).
You need to glean from that, that:
the Relational Key is “made up from the data” (my paraphrase, to the several entries, which are layers in the RM), which is Logical
that surrogates are (a) not only against that definition, (b) they are the pre-Relational paradigm, that is Physical pointers, the very thing the RM replaces, and (c) explicitly prohibited.
Very important, you need to understand not only the definition of the Relational Key, but the whys and the wherefores.
Eg. that it transcends import/export problems that pointer-based systems have.
Eg. the temporal definition (seminal; 8 letters; scary).
Therefore, there is no argument, no "debate", to be had.
Anyone going against that is anti-Relational. Not because I say so, but because it contradicts evidenced facts, and the single Authority.
I have named the explicit technical benefits of using the RM correctly (Relational Power; Relational Integrity; Relational Speed), but an expansion of that requires a fair amount of effort
The consequence of NOT complying with the RM is, you get (a) none of the benefits, AND (b) you get the complete set of problems that pre-Relational Record Filing Systems had in 1970, AND (c) the contrived "solutions" supplied by the "academics" that have never worked.
If you need an expansion of those benefits of the RM, which of course you do need to understand to some degree, because each one is very deep and very important, the best I can provide is this. As you can imagine, this is a battle that I have to fight on every Answer that relates to this subject, so I have posted a fair amount, over the years, across many Answers.
Go to my profile, select All Answers, and read any that relate to this subject.
Why is this Record ID anti-pattern so prevalent ?
The short answer is, people love their ignorance, their subjective "truths", and will fight tooth and nail to protect it. They quickly accept and repeat any justification for remaining the same. Learning something that is a paradigm shift away from what they know, is very scary, because it threatens their comfortable ignorance, and exposes it for what it really is. They will have to admit that what they have been writing for FIVE DECADES is wrong. That is why populism thrives. In ignorance.
The slightly longer answer is this. Just look at the internet. In the old days, for any particular subject, we had one source, one absolute authority: eg. buy the Encyclopædia Britannica; spend your entire childhood devouring it. Permanent truth. Honest history. But now anyone with a keyboard and two fingers plus some connective tissue (no brain required) can post. As an instant "authority". The web is chock-full of (a) superficial answers (the anti-thesis of "Now THAT is an answer") (b) in many flavours (c) that get upvoted due to populism (d) that are nowhere near the correct or full answer. Sound bites that can be easily understood by the populace. Very few want the depth of the full answer.
Even when an authority of sorts becomes established (eg. Wikipedia; Stack Overflow), it is easily subverted, because there are literally millions of people who change the entries (truth does not change, therefore, as long as something is changing, it is not truth). Mostly to serve their political positions; their ideologies; their re-writes of history to make the past wrong (it wasn't, it already happened), and the present insanity "good".
The definitive answer is this: academic envy. It took a whole decade for Codd's Relational Model to be understood and accepted. And even then, only by the few. IBM, and Britton-Lee (which became Sybase) implemented Codd's RM, in spirit and word. (Digital Equipment Corp did as well, but they are defunct.) Those academics who appeared to be working with Codd turned out to be actually working against him (by virtue of the evidence). They hated the fact that they did not come up with it themselves, that one man came up with the first real model, with a sound; logical; mathematical, foundation, complete with a Relational Algebra. All integrated. All requirements of the day (eg. the Bill Of Materials problem) answered. That has stood the test of time: five decades and nothing has been added or changed.
Typically they will declare, "but Codd did not define this or that, so here I am defining it ...". So they came up with their own RA. Now they have 17, all irrelevant. And abnormal "normal forms" to elevate fragmented bits of their Record Filing Systems to seem "relational". Now they have 42, all irrelevant. And many books, alleging to be "relational", but by evidenced fact, anti-Relational. Each "academic" seeks to reinforce their "academic" position, against all others.
Which is why I say, again, go to the one and only Authority. Read nothing from the anti-Relational crowd, because it will diminish your understanding of the RM (at best), or poison your mind (at worst).
One Clarification
If you examine a Relation PK (eg) Location.Location, it may seem odd. This is a %Code or %ShortName that is data, that the user actually uses. Usually 4 to 6 characters, max 12. As distinct from the long Name, which has to exist, and which is an Alternate Key. And of course, it is definitely not a number of any kind (which is not data, not something that the user uses). Users too, like their short forms. Obviously, use any International Standard if such exists.
The Key must be stable (not static, nothing in the universe is static), and one that is used in the real world to uniquely identify the object (data row).
Eg. for Security, which is a company listed on the stock exchange, in America, it would be TickerSymbol, in Australia ASXCode. The ISO code, an ISINCode, is an AlternateKey.
For cities, use one of the geographic location standards: ISO; FIPS; etc. (I use Statoids because it existed long before the others, but those days are numbered). At worst, use Airport Code.
Genuine SQL Platform
What do you consider to be genuine SQL? Sql Server, Postgres, MySQL, Oracle I guess all would be?
No. I mean any platform that actually complies with the published SQL Standard, and therefore can actually support relational tables; relational processing of Sets; and ACID Transactions.
That automatically excludes freeware/vapourware/nowhere/"open source", for which bits are written by 10,000 developers spread across the universe, with no governing principles. Eg. no ACID Transactions, or the structures that are required for it, which are required in every code segment. Too late to insert that now, because it will require a 100% re-write, and heaven forbid ... a Server Architecture.
Commercial
which means paid-for and supported, is also important. Either you have a maintenance contract and support is immediate, or you post a bug report and you check for updates every day for the next year or three.
Server Architecture
If either scalability or performance (high throughput; high concurrency; low latency) is required, then the Server Architecture is most important. Again, that excludes the freeware, and Oracle, because they have no Architecture, they are massive collections of interacting programs, that get the o/s to perform all the functions that a architected Database Server would normally perform.
Check this Comparison of Oracle vs Sybase Architecture.
The exact same applies to PostgreSQL and other freeware. PostgreSQL (son of the total failure Ingres) famously failed under pressure, with masses of locking problems and very low concurrency.
1 High-End, Commercial, SQL Compliant
Something like 5% market share, but 95% of the Financial Services and Automation markets. Great Architecture, hopeless marketing.
**Sybase ASE
IBM DB2**
2 Commercial, SQL Compliant
MS SQL Server
Easily the most common. Good Architecture (originally stolen from Sybase) and then "progressed" in the usual insane MS style. Pain to use; masses of overhead; poorly integrated with various add-ons and must-uses.
3 Commercial, SQL Non-Compliant
Hopeless Architecture, great marketing.
Oracle
Generally, Oracle developers are quite good at using the product in the ways that are required to get it to work, but that means they have strayed quite far away from the Relational Model.
Eg. in the Time Series benchmark, the whole point was, Oracle cacks itself when a Subquery is requested, so it has to use an "Inline View". Which the OP alleged was just as fast as a Subquery (avoiding the fact that it requires far more code, and the coder must step outside the Relational mindset). Which the benchmark proved to be hilariously false, in each scenario tested (Oracle was 3 to 4.8 times slower than Sybase on a COUNT(), 26 to 36 times slower on a SUM()
...and the Subquery (Sybase 2.1 secs) had to be abandoned after 120 mins.
Eg. Oracle is non-compliant re ACID Transactions, and developers work around that obstacle to a degree, but Phantom Updates and Lost Updates (technical terms) are simply not prevented. If the work-arounds are not written properly, entire rows (UPDATES or INSERTS are lost).
All that applies to the below ...
4 Non-Commercial, SQL Non-Compliant
These guys spend an awful lot of time developing "features" that are not required for a Relational database, but very attractive to the anti-Relational Record ID Filing Systems.
Eg. "deferred constraint checking"; ENUMs; etc.
They lack the basics of SQL compliance. Eg. no genuine ACID Transactions.
Further, as explained above, zero Architecture. This results in systems that perform wonderfully under single-use, and fail miserably under any order of pressure from concurrency or scalability.
Due to their non-compliance with the SQL requirement, they take pains to post a notice of compliance on every page in the Commands manual. (Just one declaration of compliance at the front of the manual is all that is required.) Of course, the missing commands are simply missing, so gee whiz, they do not have a compliance declaration.
PostgreSQL
The worst piece of software I have ever had to examine since the days of Ingres. Dearly loved by the "academic" crowd, simply because it was scrawled by a fellow "academic".
5 user max, or deal with the concurrency problems (just take a cursory look at the problems reported on SO).
MySQL
Head and shoulders above PostgreSQL, but still in this category.
The InnoDB engine is distinctly better in the performance department, but nowhere near the Sybase/DB2 level (still no genuine Server Architecture). No respite in the SQL non-compliance department.
5 Summary
You get what you pay for.
Server Architecture, most visibly, performance in every scenario.
SQL Compliance, thought through deeply, and implemented in every applicable code segment.
Last but not least, Support.
Whatever you choose, remember, when you port it to another platform, your SQL code will require a complete check-and-change, because the "flavours" of SQL (or NON-sql) are very different. For the Non-Commercial program suites, that means a complete rewrite. Therefore choose carefully, with the long term implementation in mind.
It depends largely on the types of query you'll need to run. I think performance may not be your biggest concern if, as you say
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario.
As queries in all scenarios would hit an indexable timestamp column, it really is just a question of the performance of joins, and pretty much every relational database is really good at that.
If your queries really are just "show data for a time range", your option 3 (an entity/attribute/value design) is most effective from a development effort point of view. .
Your query would have a single, inner join, and the timestamp column would provide a good index. As you say, you wouldn't need to change schema or queries when collecting new measurement points.
The alternative designs would require outer joins for each table. In performance terms, that's not a huge deal, but managing the schema and associated queries would be a pain.
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
TimescaleDB's documentation discusses wide versus narrow data models:
https://docs.timescale.com/timescaledb/latest/overview/data-model-flexibility/
In summary:
"A narrow model makes sense if you collect each metric independently. It allows you to add new metrics as you go by adding a new tag without requiring a formal schema change."
"If you typically query multiple metrics together, it is both faster and easier to store them in a wide table format"
Indeed, the way 3 is a sort of EAV modeling on the relational storage including timestamps into EAV key.
+---------+ +-----+ +-------------+
| Sensors | -- 1:M --< | EAV | >-- M:1 -- | Value kinds |
+---------+ +-----+ +-------------+
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario
If queries don't require joins but need to be grouped by time, the clustered index on timestamp column ensures the performance.
However, any queries with joins (i.e. comparing values of different sensors) risque to degrade the performance. The solution can be a separated OLAP storage for collected EAV data.
From a developer's point of view, I would like to recommend the third option. In your third option, you might consider having indexes on the MetricType (i.e. typeId) and the timestamp column, which will greatly optimize the query performance.
Whereas your first table requires a system downtime, as when a new column needs to be inserted, you need to shut down your live system first to add the column, initialize with some default or null values, and then bring back the system to live again. In my opinion, it will contain un-necessary data (garbage) for the previous rows from the point it was being added in the system. The size of the database table will be huge and might contain garbage data in a significant amount hence affecting the query time.
The second idea shows an improvement over the first, however, in spite of having garbage data, this will require joining multiple tables which will increase the query performance over time. You cannot have indexes on multiple tables as you could for the third option.
Hence I think going for the third option is the most effective. The tables are normalized and effective indexes will provide efficient query results.
I would like to suggest another thing. You might also consider having a separate table which will contain aggregated data. For example, if your system requires aggregated data, you might consider having the data in a denormalized style in a separate table where the aggregated values for a certain timeline can be stored so that you can remove the data from your original table which are already processed. I am referring to the OLAP database where you might consider looking into.
I wouldn't recommend an ERD design where you need to Alter whenever you add a sensor (as long as you know you will). That's why I believe you should eliminate option 1. Whenever you alter your table you will get plenty of null values and unnecessary work you might have in your code.
The same applies on Option 2, maybe except for nulls, but still you will get unnecessary work whenever you add a new data source to your system.
Option 3 Looks good fit to me, as its ready to expanding data sources and keeps data clean and neat.

Reaching an appropriate balance between performance and scalability in a large database

I'm trying to determine which of the many database models would best support probabilistic record comparison. Specifically, I have approximately 20 million documents defined by a variety of attributes (name, type, author, owner, etc.). Text attributes dominate the data set, yet there are still plenty of images. Read operations are the most crucial vis-a-vis performance, but I expect roughly 20,000 new documents to insert each week. Luckily, insert speed does not matter at all, and I am comfortable queuing the incoming documents for controlled processing.
Database queries will most typically take the following forms:
Find documents containing at least five sentences that reference someone who'a a member of the military
Predict whether User A will comment on a specific document written by User B, given User A's entire comment history
Predict an author for Document X by comparing vocabulary, word ordering, sentence structure, and concept flow
My first thought was to use a simple document store like, like MongoDB, since each document does not necessarily contain the same data. However, complex queries effectively degrade this to a file system wrapper, since I cannot construct a query yielding the results I desire. As such, this approach corners me into walking the entire database and processing each file separately. Although document stores scale well horizontally, the benefits are not realized here.
This led me to realize that my granularity isn't at the document level, but rather the entity-relationship level. As such, graph databases seemed like logical choice, since they facilitate relating each word in a sentence to the next word, next paragraph, current paragraph, part of speech, etc. Graph databases limit data replication, increase the speed of statistical clustering, and scale horizontally, among other things. Unfortunately, ensuring a definitive answer to your query still necessitates traversing the entire graph. Even still, indexing will help with performance.
I've also evaluated the use of relational databases, which are very efficient when designed properly (i.e., by avoiding unnecessary normalization). A relational database excels at finding all documents authored by User A, but fails at structural comparisons (which involves expensive joins). Relational databases also enforce constraints (primary keys, foreign keys, uniqueness, etc.) efficiently--a task with which some NoSQL solutions struggle.
After considering the above-listed requirements, are there any database models that combine the "exactness" of relational models (viz., efficient exhaustion of the domain) with the flexibility of graph databases?
This is not really an answer, just a discussion.
The database you are talking about is a large database. You don't mention the nature of the documents, but newspaper articles are typically in the 2-3k range, so you are talking about hundreds of gigabytes of raw data.
If query performance is an issue, you are talking about a large, rather expensive system.
Your requirements are also quite complex, and not likely to be out-of-the-box. I would be thinking of a hybrid system. Store the document metadata in a relational database system, so you can quickly access them with simple queries. You can store the documents themselves in the database as blobs.
Some of your requirements can be met with text-add ins on relational databases. So, simple searching is feasible using inverted index technology. That handles the first of your three scenarios.
The other two are much more challenging. The third ("predict an author") can probably be handled by having a parallel system that stores author information, summarized from the documents when they are loaded. Then it is a question of comparing a document to the author, using simple statistical analysis (naive Bayesian, anyone?).
The middle one is tricky, but it suggests yet another component for managing comments on documents. Depending on the volume, this may be easy or hard.
Finally, how changing are the requirements? Do you really know what the system should be doing? Or will the functionality be radically different once you get it up and running?

How does database query time scale with database size?

I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.
Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.
I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.
Neglecting load balancing however, how much time does it take to do a query compared to database size?
Or in other words, what is the time complexity of a query with respect to DB size?
Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:
1, 4, 9, 16, 25, 36, 49
It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).
So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.
This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)
That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.
There is no way to answer such a general question.
A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.

How does Voldemort compare to Cassandra?

How does Voldemort compare to Cassandra?
I'm not talking about size of community and only want to hear from people who have actually used both.
Especially I'm interested in:
How they dynamically scale when adding and removing nodes
Query performance
How they scale when adding nodes (linear)?
Write speed
Voldemort's support for adding nodes was just added recently (this month). So I would expect Cassandra's to be more robust given the longer time to cook and a larger community testing.
Both are fast (> 10k ops/s per machine). Because of their storage designs, I would expect Cassandra to be faster at writes, and Voldemort to be faster at reads. I would also expect Cassandra's performance to degrade less as the amount of data per node increases. And of course if you need more than just a key/value data model Cassandra's ColumnFamily model wins.
I don't know of any head-to-head benchmarks since the one done for NoSQL SF last June, which found Cassandra to be somewhat faster at whatever workload mix he was using. (The "vpork" talk from http://blog.oskarsson.nu/2009/06/nosql-debrief.html) 8 months is an eternity with projects under this much development, though.
Some additional comments:
Regarding write speed, Cassandra should be faster -- it is designed to be faster to write than read (you can avoid immediate disk hit for writes due to specialized way storage is done)
But main difference I think is actually not performance but feature set: Voldemort is strictly a key/value store (currently anyway), whereas Cassandra can offer range queries (with order-preserving partitioner), and bit more structure around data (column families etc). Former is an important consideration for design; latter IMO less so, you can always structure BLOB data on client side.

Database schema for hierarchical groups

I'm working on a database design for groups hierarchy used as the foundation of a larger system. Each group can contain other groups, and also 'devices' as leaf objects (nothing goes below device).
The database being used is MS SQL 2005. (Though working in MS SQL 2000 would be a bonus; a solution requiring MS SQL 2008 is unfortunately not feasible at this time).
There are different types of groups, and these need to be dynamic and definable at run-time by users. For example, group types might be "customer", "account", "city", or "building", "floor", and each type is going to have a different set of attributes, definable by the user. There will also be business rules applied - eg, a "floor" can only be contained underneath a "building" group, and again, these are definable at runtime.
A lot of the application functionality comes from running reports based on these groups, so there needs to be a relatively fast way to get a list of all devices contained within a certain group (and all sub-groups).
Storing groups using modified pre-order tree traversal technique has the upside that it is fast, but the downside that it is fairly complex and fragile - if external users/applications modify the database, there is the potential for complete breakage. We're also implementing an ORM layer, and this method seems to complicate using relations in most ORM libraries.
Using common table expressions and a "standard" id/parentid groups relation seem to be a powerful way to avoid running multiple recursive queries. Is there any downside to this method?
As far as attributes, what is the best way to store them? A long, narrow table that relates back to group? Should a common attribute, like "name" be stored in a groups table, instead of the attributes table (a lot of the time, the name will be all that is required to display)?
Are there going to be performance issues using this method (let's assume a high average of 2000 groups with average of 6 attributes each, and average 10 concurrent users, on a reasonable piece of hardware, eg, quad-core Xeon 2 Ghz, 4GB ram, discounting any other processes)?
Feel free to suggest a completely different schema than what I've outlined here. I was just trying to illustrate the issues I'm concerned about.
I'd recommend you actually construct the easiest-to-maintain way (the "standard" parent/child setup) and run at least some basic benchmarks on it.
You'd be surprised what a database engine can do with the proper indexing, especially if your dataset can fit into memory.
Assuming 6 attributes per group, 2000 groups, and 30 bytes/attribute, you're talking 360KB*expected items/group -- figure 400KB. If you expect to have 1000 items/group, you're only looking at 400MB of data -- that'll fit in memory without a problem, and databases are fast at joins when all the data is in memory.
Common table expressions will let you get out a list of groups with the parent-child relations. Here is an example of a sproc using CTE's for a different application. It's reasonably efficient but beware the following caveats:
If a part occurs more than once in the hierarchy it will be reported at each location. You may need to post-process the results.
CTE's are somewhat obtuse and offer limited scope to filter results within the query - the CTE may not appear more than once within the select statement.
Oracle's CONNECT BY is somewhat more flexible as it doesn't impose nearly as many limitations on the query structure as CTE's do, but if you're using SQL Server this won't be an option.
If you need to do anything clever with the intermediate results then write a sproc that uses the CTE to get a raw query into a temporary table and work on it from there. SELECT INTO will minimise the traffic incurred in this. The resulting table will be in cache so operations on it will be reasonably quick.
Some possible physical optimisations that could help:
Clustered indexes on the parent so
that getting out child nodes for a
parent uses less I/O.
Lots of RAM and (depending on the size of your BOM table) 64-bit
servers with even more RAM so that the main BOM table
can be cached in core. On a 32 bit O/S the /3G boot switch is your friend and has no real downside for a database server
DBCC
PINTABLE can help force the database manager to hold the table in cache.
Parent-Attribute Type-Attribute coding tables will not play nicely with CTE's as you will wind up with a combinatorical explosion in your row counts if you include the attribute table. This would preclude any business logic in the query that filtered on attributes.
You would be much better off storing the attributes directly on the BOM table entry.
Pre-order Tree Traversal is very handy. You can make it robust by keeping the traversal numbers up to date with triggers.
A similar technique which I have used is to keep a separate table of (ancestor_id, descendant_id) which lists all ancestors and descendants. This is nearly as good as pre-order traversal numbers.
Using a separate table is handy, because even though it introduces an extra join, it does remove the complexity into a separate table.
The modified pre-order is, essentially, Joe Celko's Nested Sets method. His book, "Trees and Hierarchies..." covers both adjacency list and NS, with descriptions of advantages and disadvantages of each. With proper indexing, CTE of adjacency lists gets the most balanced performance. If you're going for read-mostly, then NS will be faster.
What you seem to be describing is a Bill of Material processor. While not M$, Graeme Birchall has a free DB2 book, with a chapter on hierarchy processing using CTE (the syntax is virtually identical, IIRC, in that the ANSI syntax adopted DB2's, which M$ then adopted): http://mysite.verizon.net/Graeme_Birchall/cookbook/DB2V95CK.PDF

Resources