What is being distributed in a distributed database? - database

Processing logic : processing logic or processing elements are distributed
Data : used by a number of applications may be distributed to a number of processing sites
Control : The control of the execution of various tasks might be distributed instead of being performed by one computer system.
Can you please explain these three parts more briefly?

Processing logic : processing logic or processing elements are distributed
This is so you can optimise the query itself when you know that the data that the query will be retrieving from will be 'scatter' across different places(usually across the net)
Consider a situation where you want to say, get all the employees from a DB, but the actual DB is been breaking down into two fragments horizontally, but the employees you want only exists in one out of the two fragments, now consider if you don't distribute the processing logic, you will have to put the two fragments together by union them, and only to be making use of only one half of the data, so the cost of transferring the other half that isn't really required will result then be just wasted computing in the form of longer response time or overall wait time, etc.
Data : used by a number of applications may be distributed to a number of processing sites
We mentioned the idea of fragments just before, but the idea of fragment is really just a formal way of defining how the data should be 'break down'.
Usually, the fragments will be either horizontal fragments or vertical fragments.
A fragment should have a property known as the correctness property. The correctness property demands three conditions to be held true for any fragments, a somewhat 'simplified' interpretation of these conditions are
When you put back the fragments, you get the original table.
All the records from the original table should be presented in a fragment, otherwise data will be lost.
Each record only show up in one fragment.
A trashy analogy would be that, basically think about how you have a piece of paper, you tear the piece paper up out of anger, but then you suddenly realise you had something important written on the piece paper, you really want to be able to put them back to its original state, if all the pieces of the paper were disjointed, and all the information written were completely written on those pieces, and lastly you have all the original pieces that you have just torn in front of you, so you could just reconstruct the original thing,
Control : The control of the execution of various tasks might be distributed instead of being performed by one computer system.
I think this mostly tights into the performance aspect of DDB and some aspect of control access. So instead of running queries in one place.


Reaching an appropriate balance between performance and scalability in a large database

I'm trying to determine which of the many database models would best support probabilistic record comparison. Specifically, I have approximately 20 million documents defined by a variety of attributes (name, type, author, owner, etc.). Text attributes dominate the data set, yet there are still plenty of images. Read operations are the most crucial vis-a-vis performance, but I expect roughly 20,000 new documents to insert each week. Luckily, insert speed does not matter at all, and I am comfortable queuing the incoming documents for controlled processing.
Database queries will most typically take the following forms:
Find documents containing at least five sentences that reference someone who'a a member of the military
Predict whether User A will comment on a specific document written by User B, given User A's entire comment history
Predict an author for Document X by comparing vocabulary, word ordering, sentence structure, and concept flow
My first thought was to use a simple document store like, like MongoDB, since each document does not necessarily contain the same data. However, complex queries effectively degrade this to a file system wrapper, since I cannot construct a query yielding the results I desire. As such, this approach corners me into walking the entire database and processing each file separately. Although document stores scale well horizontally, the benefits are not realized here.
This led me to realize that my granularity isn't at the document level, but rather the entity-relationship level. As such, graph databases seemed like logical choice, since they facilitate relating each word in a sentence to the next word, next paragraph, current paragraph, part of speech, etc. Graph databases limit data replication, increase the speed of statistical clustering, and scale horizontally, among other things. Unfortunately, ensuring a definitive answer to your query still necessitates traversing the entire graph. Even still, indexing will help with performance.
I've also evaluated the use of relational databases, which are very efficient when designed properly (i.e., by avoiding unnecessary normalization). A relational database excels at finding all documents authored by User A, but fails at structural comparisons (which involves expensive joins). Relational databases also enforce constraints (primary keys, foreign keys, uniqueness, etc.) efficiently--a task with which some NoSQL solutions struggle.
After considering the above-listed requirements, are there any database models that combine the "exactness" of relational models (viz., efficient exhaustion of the domain) with the flexibility of graph databases?
This is not really an answer, just a discussion.
The database you are talking about is a large database. You don't mention the nature of the documents, but newspaper articles are typically in the 2-3k range, so you are talking about hundreds of gigabytes of raw data.
If query performance is an issue, you are talking about a large, rather expensive system.
Your requirements are also quite complex, and not likely to be out-of-the-box. I would be thinking of a hybrid system. Store the document metadata in a relational database system, so you can quickly access them with simple queries. You can store the documents themselves in the database as blobs.
Some of your requirements can be met with text-add ins on relational databases. So, simple searching is feasible using inverted index technology. That handles the first of your three scenarios.
The other two are much more challenging. The third ("predict an author") can probably be handled by having a parallel system that stores author information, summarized from the documents when they are loaded. Then it is a question of comparing a document to the author, using simple statistical analysis (naive Bayesian, anyone?).
The middle one is tricky, but it suggests yet another component for managing comments on documents. Depending on the volume, this may be easy or hard.
Finally, how changing are the requirements? Do you really know what the system should be doing? Or will the functionality be radically different once you get it up and running?

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?
Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
Combining relational and OLAP:
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.
Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.
there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

Fast read-only embedded "database"?

I'm looking to distribute some information to different machines for efficient and extremely fast access without any network overhead. The data exists in a relational schema, and it is a requirement to "join" on relations between entities, but it is not a requirement to write to the database at all (it will be generated offline).
I had alot of confidence that SQLite would deliver on performance, but RDMBS seems to be unsuitable at a fundamental level: joins are very expensive due to cost of index lookups, and in my read-only context, are an unnecessary overhead, where entities could store direct references to each other in the form of file offsets. In this way, an index lookup is switched for a file seek.
What are my options here? Database doesn't really seem to describe what I'm looking for. I'm aware of Neo4j, but I can't embed Java in my app.
Edit, to answer the comments:
The data will be up to 1gb in size, and I'm using PHP so keeping the data in memory is not really an option. I will rely on the OS buffer cache to avoid continually going to disk.
Example would be a Product table with 15 fields of mix type, and a query to list products with a certain make, joining on a Category table.
The solution will have to be some kind of flat file. I'm wondering if there already exists some software that meets my needs.
#Mark Wilkins:
The performance problem is measured. Essentially, it is unacceptable in my situation to replace a 2ms IO bound query to Memcache with an 5ms CPU bound call to SQLite... For example, the categories table has 500 records, containing parent and child categories. The following query takes ~8ms, with no disk IO: SELECT 1 FROM categories a INNER JOIN categories B on b.id = a.parent_id. Some simpler, join-less queries are very fast.
I may not be completely clear on your goals as to the types of queries you are needing. But the part about storing file offsets to other data seems like it would be a very brittle solution that is hard to maintain and debug. There might be some tool that would help with it, but my suspicion is that you would end up writing most of it yourself. If someone else had to come along later and debug and figure out a homegrown file format, it would be more work.
However, my first thought is to wonder if the described performance problem is estimated at this point or actually measured. Have you run the tests with the data in a relational format to see how fast it actually is? It is true that a join will almost always involve more file reads (do the binary search as you mentioned and then get the associated record information and then lookup that record). This could take 4 or 5 or more disk operations ... at first. But in the categories table (from the OP), it could end up cached if it is commonly hit. This is a complete guess on my part, but in many situations the number of categories is relatively small. If that is the case here, the entire category table and its index may stay cached in memory by the OS and thus result in very fast joins.
If the performance is indeed a real problem, another possibility might be to denormalize the data. In the categories example, just duplicate the category value/name and store it with each product record. The database size will grow as a result, but you could still use an embedded database (there are a number of possibilities). If done judiciously, it could still be maintained reasonably well and provide the ability to read full object with one lookup/seek and one read.
In general probably the fastest thing you can do at first is to denormalize your data thus avoiding JOINs and other mutli-table lookups.
Using SQLite you can certainly customize all sorts of things and tailor them to your needs. For example, disable all mutexing if you're only accessing via one thread, up the memory cache size, customize indexes (including getting rid of many), custom build to disable unnecessary meta data, debugging, etc.
Take a look at the following:
PRAGMA Statements: http://www.sqlite.org/pragma.html
Custom Builds of SQLite: http://www.sqlite.org/custombuild.html
SQLite Query Planner: http://www.sqlite.org/optoverview.html
SQLite Compile Options: http://www.sqlite.org/compile.html
This is all of course assuming a database is what you need.

hardware specialized for bitmap indexes?

This is just an out of curiosity question. Let's say you have a database table with 1m rows in it, and you want to often do queries like looking for either male or female, US or non-US, voter or non-voter etc, it's clearly very efficient to define a bitmap index for the table in which each bit represents one either-or condition.
However, to execute the query, you still have to scan through (probably) all of the index doing a bitand to select matching rows.
My question is is there some kind of bitmap-optimized storage such that the bit 'channels' are pre-created in the hardware? I'm envisaging something similar to knitting needles lifting punched cards out of an old library catalog system. In other words, rather than going row by row through memory locations, the chip can just pull out the matching rows electronically because there are hardware connections for each bit channel? I've a feeling the brain must work something like this. If I think of 'all blue objects', and then restrict that to 'all long blue objects' and then 'all long blue heavy objects', my brain does it effortlessly and I'm sure it's not scanning through all the objects I know about every time. It seems like perhaps there is some neurons that provide pathways for different dimensions for quick retrieval. I'm just wondering if there's anything like this in the hardware world?
Why invent something that's already there?
Content-addressable memory
You could certainly wire up some logic to perform this (e.g. using programmable logic devices) but you'll need a large number of logic elements and connections, making such circuits probably expensive to build for large databases.
For example, one would have to build matching logic (is this bit being selected on ? what is the required value ?) into each 'row' giving you one signal (selected/not selected) per row.
You would then have a logic circuit with one million output lines (telling you which records were selected) which you probably at some point have to 'serialize' anyway, e.g. when you interface with the PCI bus inside a computer (i.e. first transmit the result for record 0 then 1 etc. or transmit the numbers of the selected records).
As bitwise operations in modern CPUs are fast (should only take one clock cycle for logicl operations such as bitwise and, or and 'xor') you're probably not gaining much using such a custom circuit compared to optimized software (not mentioning the 'hardware' development and testing effort) unless you have a very special use case.

Storing Signals in a Database

I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.
