Graph database vs. RDB with link/bridge tables - database

I work in the fraud/AML (anti-money laundering) field, and we are exploring using a graph database to unearth hidden connections and links. I've read a fair amount abut graph databases lately (mostly neo4j, but I think the concepts are similar across different products?), and from what I can tell, they seem to be well-suited to this domain. The issue is that I'm having a hard time getting buy-in from tech management, as they seem to think that we can do the same things with our existing data reporting model, which is in Hadoop, and is essentially a data warehouse which has specific tables that provide many-to-many link tables between the core tables (I believe Kimball calls them 'bridge' tables?).
In a way, they seem to provide the same functionality as the relationship tables in a graph DB. Given that we have already constructed the link tablesin Hadoop, would a graph database provide any performance advantage for the kinds of things we may want to do (e.g. How is Customer A connected to Customer B), or have we largely negated any performance advantage of a graph DB by building all of the link tables?

On similar hardware platforms, a relational database will never be able to keep up with a well constructed graph database when performing "path-between" queries. Never.
Every graph database product has its own internal storage representation, but they are all fundamentally designed to store nodes and edges and support navigational queries across those nodes and edges. Without the addition of new graph-support features, relational database will struggle to provide graph-like capabilities.
The other advantage of using a native graph database is that the graph query languages are specifically designed to support path-between queries. In Objectivity/DB, a massively scalable and distributable object/graph database, we can use the DO query language to find all of the paths between two entities up to a specified number of degrees apart in milliseconds or seconds. A DO query might look like the following:
Match p = (:Account { accountId = "1234"})
-[*..100]->
(:Account { accountId = "5678"})
return p;
Here, we are saying: Find all paths (p) from Account 1234 to Account 5678, where they are between 1 and 100 degrees apart.
To create and execute this same query in a relational database would be much more complicated (without the addition of graph features to the database) and the execution of a query like this in a relational database would be much more resource intensive (memory, cpu, I/O).
If you have the opportunity to explore graph database for your project, make sure you understand your scalability and data distribution requirements. That information will be key to selecting the correct product.
Disclaimer: I am the Director of Field Operations for Objectivity.

Related

Best database and database design for this particular need?

I'm looking to store around 50-100 million documents in a database and be able to do queries at a very fast speed. A document would look something like this:
{
name: 'example',
value: '300,201,512'
}
The value column is always unique, name is not.
Now I want to be able to only check if there exists a document with a specific value using a query. What database would be the best choice and which design would be best to approach the fastest speed for a query like that?
NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help.
In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Note: When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).
These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.
Fasted document based db
1) MongoDB
2) DynamoDB
Here is difference for your reference
I will give preference to DynamoDB
Currently, we are working on aws datalake, really fast performance
store data in s3 and get back via athena.
If you want to import data on to some database then try using MS SQL Server 2008 R2, because it is very user friendly and allow you to do your work more accurately and precisely. If you want to do that without any cost, then MySQL will be a better option to do so(better MySQL editor is SQLYog). I hope it would be beneficial for you.
Short Answer:
I think, 100 million documents in your mentioned structure and conditions is not BIG ENOUGH to use NoSQL. You can handle them with PostgreSQL and MySQL and etc.
Note that: for a long time Wikipedia used MySQL (not now). see Reference

Do graph databases deprecate relational databases?

I'm new to DBs of any kind. It seems you can represent any relational database in graph form (although it might be a very flat graph), and any graph database in a relational database (with enough tables).
A graph can avoid a lot of lookups in other tables by having a hard link from one entry to another, so in many/most cases I can see the speed advantage of a graph. If your data is naturally hierarchical, and especially if it forms a tree, I see the logical/reasoning benefit to a graph over relational. I imagine a node of a graph which links to other nodes probably contains multiple maps or lists... which is effectively containing a relational DB within nodes of a graph.
Are there any disadvantages to a graph db vs a relational db? (Note: I'm not looking to things like missing features in implementations, but instead the theoretical pros/cons)
When should I still use a relational database? Even if I logically have a single mapping of an int to int I could do it in a graph.
Graph databases were deprecated by relational-ish technology some 20 to 30 years ago.
The major theoretical disadvantage is that graph databases use TWO basic concepts to represent information (nodes and edges), whereas a relational database uses only one (the relation). This bleeds over into the language for data manipulation, in that a graph-based language must provide two distinct sets of operators : one for operating on nodes, and one for operating on edges. The relational model can suffice with only one.
More operators means more operators to implement for the DBMS builder, more opportunity for bugs, and for the user it means more distinct language constructs to learn. For example, adding information to a database is just INSERT in relational, in graph-based it can be either STORE (nodes) or CONNECT (edges). Removing information is just DELETE (relational), as opposed to either ERASE (nodes) or DISCONNECT (edges).
Building on Erwin Smout's fine answer, an important reason why the relational model supplanted the graph one is that a graph has a greater degree of "bias" baked into its structure than relations do. The edges of a graph are navigational links which user queries are expected to traverse in a particular way. A relational model of the same data assumes much less about how the data will be used. Users are free to join and manipulate relational data in ways that the database designer might not have foreseen. The disruptive costs of re-engineering graph database structures to support new requirements were a factor which drove the adoption of the relational model and its SQL-based offshoots in the 1980s.
Relational databases were designed to aggregate data, graph to find relations.
If You have for example a financial domain, all connections are known, You only aggregate data by other data to find sums and so on.
Graph databases are better in more chaotic domain where to connections are more important, and not all connections are apparent, for example:
networks of people, with different relations with one and other
films and people creating them. Not just actors but the whole crew.
natural language processing and finding connections between recognized words
Data model is important, but what matters more is how you access your data. Notice, there are very few (none, actually) sharded or otherwise distributed graph databases out there. If you compare insertion speed into a typical relational database and a graph database, your relational database will most likely win.
Yes, graph model is more versatile than relational model, but it doesn't make it universal - in some cases, this versatility is a roadblock for optimizations.
In fact, modern graph databases are a niche solutions for a narrow set of tasks - finding a route from A to B, working with friends in a social network, information technology in medicine.
For most business applications relational databases continue to prevail.
I'm missing the performance aspect in the answers above.
Performance of graph based data bases is inherently worse for scalar and maybe even tree based models. Only if you have a real graph, they may exhibit better performance.
Also most graph DBs do not feature ACID support such as almost any RDBMS.
From my real life experience I can tell almost any evolving data model will sooner or later become a graph and that's why graph DBs are superior in terms of flexibility and agility (they keep pace with the evolution of your data model).
That's why I don't think that RDBs will prevail for "For most business applications" as #Kostja says. I think they will prevail where ACID capability is essential.

Practical example for each type of database (real cases) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
There are several types of database for different purposes, however normally MySQL is used to everything, because is the most well know Database. Just to give an example in my company an application of big data has a MySQL database at an initial stage, what is unbelievable and will bring serious consequences to the company. Why MySQL? Just because no one know how (and when) should use another DBMS.
So, my question is not about vendors, but type of databases. Can you give me an practical example of specific situations (or apps) for each type of database where is highly recommended to use it?
Example:
• A social network should use the type X because of Y.
• MongoDB or couch DB can't support transactions, so Document DB is not good to an app for a bank or auctions site.
And so on...
Relational: MySQL, PostgreSQL, SQLite, Firebird, MariaDB, Oracle DB, SQL server, IBM DB2, IBM Informix, Teradata
Object: ZODB, DB4O, Eloquera, Versant , Objectivity DB, VelocityDB
Graph databases: AllegroGraph, Neo4j, OrientDB, InfiniteGraph, graphbase, sparkledb, flockdb, BrightstarDB
Key value-stores: Amazon DynamoDB, Redis, Riak, Voldemort, FoundationDB, leveldb, BangDB, KAI, hamsterdb, Tarantool, Maxtable, HyperDex, Genomu, Memcachedb
Column family: Big table, Hbase, hyper table, Cassandra, Apache Accumulo
RDF Stores: Apache Jena, Sesame
Multimodel Databases: arangodb, Datomic, Orient DB, FatDB, AlchemyDB
Document: Mongo DB, Couch DB, Rethink DB, Raven DB, terrastore, Jas DB, Raptor DB, djon DB, EJDB, denso DB, Couchbase
XML Databases: BaseX, Sedna, eXist
Hierarchical: InterSystems Caché, GT.M thanks to #Laurent Parenteau
I found two impressive articles about this subject. All credits to highscalability.com. The information in this answer is transcribed from these articles:
35+ Use Cases For Choosing Your Next NoSQL Database
What The Heck Are You Actually Using NoSQL For?
If Your Application Needs...
• complex transactions because you can't afford to lose data or if you would like a simple transaction programming model then look at a Relational or Grid database.
• Example: an inventory system that might want full ACID. I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!
• to scale then NoSQL or SQL can work. Look for systems that support scale-out, partitioning, live addition and removal of machines, load balancing, automatic sharding and rebalancing, and fault tolerance.
• to always be able to write to a database because you need high availability then look at Bigtable Clones which feature eventual consistency.
• to handle lots of small continuous reads and writes, that may be volatile, then look at Document or Key-value or databases offering fast in-memory access. Also, consider SSD.
• to implement social network operations then you first may want a Graph database or second, a database like Riak that supports relationships. An in-memory relational database with simple SQL joins might suffice for small data sets. Redis' set and list operations could work too.
• to operate over a wide variety of access patterns and data types then look at a Document database, they generally are flexible and perform well.
• powerful offline reporting with large datasets then look at Hadoop first and second, products that support MapReduce. Supporting MapReduce isn't the same as being good at it.
• to span multiple data-centers then look at Bigtable Clones and other products that offer a distributed option that can handle the long latencies and are partition tolerant.
• to build CRUD apps then look at a Document database, they make it easy to access complex data without joins.
• built-in search then look at Riak.
• to operate on data structures like lists, sets, queues, publish-subscribe then look at Redis. Useful for distributed locking, capped logs, and a lot more.
• programmer friendliness in the form of programmer-friendly data types like JSON, HTTP, REST, Javascript then first look at Document databases and then Key-value Databases.
• transactions combined with materialized views for real-time data feeds then look at VoltDB. Great for data-rollups and time windowing.
• enterprise-level support and SLAs then look for a product that makes a point of catering to that market. Membase is an example.
• to log continuous streams of data that may have no consistency guarantees necessary at all then look at Bigtable Clones because they generally work on distributed file systems that can handle a lot of writes.
• to be as simple as possible to operate then look for a hosted or PaaS solution because they will do all the work for you.
• to be sold to enterprise customers then consider a Relational Database because they are used to relational technology.
• to dynamically build relationships between objects that have dynamic properties then consider a Graph Database because often they will not require a schema and models can be built incrementally through programming.
• to support large media then look storage services like S3. NoSQL systems tend not to handle large BLOBS, though MongoDB has a file service.
• to bulk upload lots of data quickly and efficiently then look for a product that supports that scenario. Most will not because they don't support bulk operations.
• an easier upgrade path then use a fluid schema system like a Document Database or a Key-value Database because it supports optional fields, adding fields, and field deletions without the need to build an entire schema migration framework.
• to implement integrity constraints then pick a database that supports SQL DDL, implement them in stored procedures, or implement them in application code.
• a very deep join depth then use a Graph Database because they support blisteringly fast navigation between entities.
• to move behavior close to the data so the data doesn't have to be moved over the network then look at stored procedures of one kind or another. These can be found in Relational, Grid, Document, and even Key-value databases.
• to cache or store BLOB data then look at a Key-value store. Caching can for bits of web pages, or to save complex objects that were expensive to join in a relational database, to reduce latency, and so on.
• a proven track record like not corrupting data and just generally working then pick an established product and when you hit scaling (or other issues) use one of the common workarounds (scale-up, tuning, memcached, sharding, denormalization, etc).
• fluid data types because your data isn't tabular in nature, or requires a flexible number of columns, or has a complex structure, or varies by user (or whatever), then look at Document, Key-value, and Bigtable Clone databases. Each has a lot of flexibility in their data types.
• other business units to run quick relational queries so you don't have to reimplement everything then use a database that supports SQL.
• to operate in the cloud and automatically take full advantage of cloud features then we may not be there yet.
• support for secondary indexes so you can look up data by different keys then look at relational databases and Cassandra's new secondary index support.
• create an ever-growing set of data (really BigData) that rarely gets accessed then look at Bigtable Clone which will spread the data over a distributed file system.
• to integrate with other services then check if the database provides some sort of write-behind syncing feature so you can capture database changes and feed them into other systems to ensure consistency.
• fault tolerance check how durable writes are in the face power failures, partitions, and other failure scenarios.
• to push the technological envelope in a direction nobody seems to be going then build it yourself because that's what it takes to be great sometimes.
• to work on a mobile platform then look at CouchDB/Mobile couchbase.
General Use Cases (NoSQL)
• Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
• Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month (in 2010). Twitter, for example, has the problem of storing 7 TB/data per day (in 2010) with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
• Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
• Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program- and programmer-friendly compatible datatypes like JSON.
• Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
• Write availability. Do your writes need to succeed no matter what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
• Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
• No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
• Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
• Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer-led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
• Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying to find the best fit between a problem and solution.
• Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
• Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
• Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end, this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
• More Specific Use Cases
• Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
• Syncing online and offline data. This is a niche CouchDB has targeted.
• Fast response times under all loads.
• Avoiding heavy joins for when the query load for complex joins become too large for an RDBMS.
• Soft real-time systems where low latency is critical. Games are one example.
• Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. A read-only application which complex query requirements.
• Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
• Real-time inserts, updates, and queries.
• Hierarchical data like threaded discussions and parts explosion.
• Dynamic table creation.
• Two-tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
• Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
• Slicing off part of service that may need better performance/scalability onto its own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
• Caching. A high performance caching tier for websites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
• Real-time page view counters.
• User registration, profile, and session data.
• Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
• Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
• Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
• Working with heterogeneous types of data, for example, different media types at a generic level.
• Embedded systems. They don’t want the overhead of SQL and servers, so they use something simpler for storage.
• A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
• JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3. (source)
• Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
• Fraud detection by comparing transactions to known patterns in real-time.
• Helping diagnose the typology of tumors by integrating the history of every patient.
• In-memory database for high update situations, like a website that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
• Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
• Priority queues.
• Running calculations on cached data, using a program friendly interface, without having to go through an ORM.
• Uniq a large dataset using simple key-value columns.
• To keep querying fast, values can be rolled-up into different time slices.
• Computing the intersection of two massive sets, where a join would be too slow.
• A timeline ala Twitter.
Redis use cases, VoltDB use cases and more find here.
This question is almost impossible to answer because of the generality. I think you are looking for some sort of easy answer problem = solution. The problem is that each "problem" becomes more and more unique as it becomes a business.
What do you call a social network? Twitter? Facebook? LinkedIn? Stack Overflow? They all use different solutions for different parts, and many solutions can exist that use polyglot approach. Twitter has a graph like concept, but there are only 1 degree connections, followers and following. LinkedIn on the other hand thrives on showing how people are connected beyond first degree. These are two different processing and data needs, but both are "social networks".
If you have a "social network" but don't do any discovery mechanisms, then you can easily use any basic key-value store most likely. If you need high performance, horizontal scale, and will have secondary indexes or full-text search, you could use Couchbase.
If you are doing machine learning on top of the log data you are gathering, you can integrate Hadoop with Hive or Pig, or Spark/Shark. Or you can do a lambda architecture and use many different systems with Storm.
If you are doing discovery via graph like queries that go beyond 2nd degree vertexes and also filter on edge properties you likely will consider graph databases on top of your primary store. However graph databases aren't good choices for session store, or as general purpose stores, so you will need a polyglot solution to be efficient.
What is the data velocity? scale? how do you want to manage it. What are the expertise you have available in the company or startup. There are a number of reasons this is not a simple question and answer.
A short useful read specific to database selection: How to choose a NoSQL Database?. I will highlight keypoints in this answer.
Key-Value vs Document-oriented
Key-value stores
If you have clear data structure defined such that all the data would have exactly one key, go for a key-value store. It’s like you have a big Hashtable, and people mostly use it for Cache stores or clearly key based data. However, things start going a little nasty when you need query the same data on basis of multiple keys!
Some key value stores are: memcached, Redis, Aerospike.
Two important things about designing your data model around key-value store are:
You need to know all use cases in advance and you could not change the query-able fields in your data without a redesign.
Remember, if you are going to maintain multiple keys around same data in a key-value store, updates to multiple tables/buckets/collection/whatever are NOT atomic. You need to deal with this yourself.
Document-oriented
If you are just moving away from RDBMS and want to keep your data in as object way and as close to table-like structure as possible, document-structure is the way to go! Particularly useful when you are creating an app and don’t want to deal with RDBMS table design early-on (in prototyping stage) and your schema could change drastically over time. However note:
Secondary indexes may not perform as well.
Transactions are not available.
Popular document-oriented databases are: MongoDB, Couchbase.
Comparing Key-value NoSQL databases
memcached
In-memory cache
No persistence
TTL supported
client-side clustering only (client stores value at multiple nodes). Horizontally scalable through client.
Not good for large-size values/documents
Redis
In-memory cache
Disk supported – backup and rebuild from disk
TTL supported
Super-fast (see benchmarks)
Data structure support in addition to key-value
Clustering support not mature enough yet. Vertically scalable (see Redis Cluster specification)
Horizontal scaling could be tricky.
Supports Secondary indexes
Aerospike
Both in-memory & on-disk
Extremely fast (could support >1 Million TPS on a single node)
Horizontally scalable. Server side clustering. Sharded & replicated data
Automatic failovers
Supports Secondary indexes
CAS (safe read-modify-write) operations, TTL support
Enterprise class
Comparing document-oriented NoSQL databases
MongoDB
Fast
Mature & stable – feature rich
Supports failovers
Horizontally scalable reads – read from replica/secondary
Writes not scalable horizontally unless you use mongo shards
Supports advanced querying
Supports multiple secondary indexes
Shards architecture becomes tricky, not scalable beyond a point where you need secondary indexes. Elementary shard deployment need 9 nodes at minimum.
Document-level locks are a problem if you have a very high write-rate
Couchbase Server
Fast
Sharded cluster instead of master-slave of mongodb
Hot failover support
Horizontally scalable
Supports secondary indexes through views
Learning curve bigger than MongoDB
Claims to be faster

Google's Bigtable vs. A Relational Database [duplicate]

This question already has answers here:
Closed 13 years ago.
Duplicates
Why should I use document based database instead of relational database?
Pros/Cons of document based database vs relational database
I don't know much about Google's Bigtable but am wondering what the difference between Google's Bigtable and relational databases like MySQL is. What are the limitations of both?
Bigtable is Google's invention to deal with the massive amounts of information that the company regularly deals in. A Bigtable dataset can grow to immense size (many petabytes) with storage distributed across a large number of servers. The systems using Bigtable include projects like Google's web index and Google Earth.
According to Google whitepaper on the subject:
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
The internal mechanics of Bigtable versus, say, MySQL are so dissimilar as to make comparison difficult, and the intended goals don't overlap much either. But you can think of Bigtable a bit like a single-table database. Imagine, for example, the difficulties you would run into if you tried to implement Google's entire web search system with a MySQL database -- Bigtable was built around solving those problems.
Bigtable datasets can be queried from services like AppEngine using a language called GQL ("gee-kwal") which is a based on a subset of SQL. Conspicuously missing from GQL is any sort of JOIN command. Because of the distributed nature of a Bigtable database, performing a join between two tables would be terribly inefficient. Instead, the programmer has to implement such logic in his application, or design his application so as to not need it.
Google's BigTable and other similar projects (ex: CouchDB, HBase) are database systems that are oriented so that data is mostly denormalized (ie, duplicated and grouped).
The main advantages are:
- Join operations are less costly because of the denormalization
- Replication/distribution of data is less costly because of data independence (ie, if you want to distribute data across two nodes, you probably won't have the problem of having an entity in one node and other related entity in another node because similar data is grouped)
This kind of systems are indicated for applications that need to achieve optimal scale (ie, you add more nodes to the system and performance increases proportionally). In an RDBMS like MySQL or Oracle, when you start adding more nodes if you join two tables that are not in the same node, the join cost is higher. This becomes important when you are dealing with high volumes.
RDBMS' are nice because of the richness of the storage model (tables, joins, fks). Distributed databases are nice because of the ease of scale.

What is an example of a non-relational database? Where/how are they used?

I have been working with relational databases for sometime, but it only recently occurred to me that there must be other types of databases that are non-relational.
What are some examples of non-relational databases, and where/how are they used in the real world? Why would you choose to use a non-relational database over relational databases?
Edit: Two other similar questions have been mentioned in the answers:
Database system that is not relational.
Good reasons NOT to use a relational database?
An admittedly obscure but interesting alternative to the types of databases mentioned here is the associative database, such as Sentences, from LazySoft Technology. There is a free personal version you can download and try on your own. The Enterprise Edition is also free, but requires a request to the company.
Essentially, an associative database allows you to store information in much the same way as our brains do: as things and associations between those things. The name "Sentences" comes from the way this information can be represented in a subject-verb-object syntax:
Tom is brother to Laura
San Francisco is located in California
Mike has a credit limit of $10,000
A sentence may be the subject or object of another sentence:
(Bus 570 arrives at 8:15am) on Sundays
Mary says (the pie was baked by William)
So, everything can be boiled down to entities and associations.
There is, of course, much more to Sentences than what can be expressed here. I recommend that you take some time to read more about it in a white paper from LazySoft.
"The Associative Model of Data" is a book available in PDF format by Simon Williams, one of the creators of Sentences.
Flat file
CSV or other delimited data
spreadsheets
/etc/passwd
mbox mail files
Hierarchical
Windows Registry
Subversion using the file system, FSFS, instead of Berkley DB
A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.
Any database that claims to be a "Berkley style Database" or "Key/Value" Database is not relational.
These databases are usually based off complex hashing algorithms and provide a very fast lookup O(1) based off a key, but leave any form of relational goodness to end user.
For example, in a relational database, you would normalize your structure and join many tables together to create a single result set.
In a key/value database, you would denormalize as much as possible and then use a unique key to lookup data.
If you need to pull data from two sources, you would have to join the resulting set together by hand.
All databases were originally non-relational, it was only with the arrival of DB2 and Oracle in the mid 1980's that they became common. Before that most databases where either flat files or hierarchical.
Flat files are inherently boring, but hierarchical database are much less so, particularly as DB2 was actually implemented on top of an hierarchical implementation (namely VSAM) in the first instance. VSAM is I believe still around on mainframe systems and is of some considerable importance.
DB/1 (so obscure now I can't even find a wikipedia link) was IBM's predecessor prime-time database to DB2 (hence the name). This was hierarchical - basically you had a file which consisted of any number or 'root' records, generally directly accessible by a key. Each root record could then have any number of child records off it, each of which could in turn have it's own children. The net effect is a index file or root records with each root being the top of a potential tree-like structure. Accessing the child records could be tricky - there were limitations of direct access so usually you ended up traversing the tree looking for the record you needed. A 'database' could have any number of these files in it, usually related by keys.
This had major disadvantages - not least that actually doing anything demanded a full program written - basically the equivalent of a days work for what we can now do in SQL in a few minutes. However it really did score on execution speed, in those days a mainframe had about the processing power of your iPhone (albeit optimized for data I/O) and poor DB2 queries could kill a multi-million dollar installation dead. This was never an issue with DB/1 and in a world where programmers were less expensive than CPU time it made sense.
Google App Engine Datastore :
The App Engine datastore is not a relational database. While the datastore interface has many of the same features of traditional databases, the datastore's unique characteristics imply a different way of designing and managing data to take advantage of the ability to scale automatically.
The PI historical database from OSIsoft is non-relational. It's only made to archive timestamped data. It's used a lot by industry, especially as the back-end database for all those 'dashboards'.
There's no need to be relational in it, since there are no joins.
Other two types of databases that haven't come up yet:
Content Repositories are databases designed for content (i.e. files, documents, images, etc). They typically have additioan constructs such as a hierarchical way to browse content, search, transformation between different formats, versioning, and many other things. Examples - Alfresco, Documentum, JackRabbit, Day, OpenText, many other ECM vendors.
Directories, i.e. Active Directory, or LDAP Directories. These are databases designed for low-write / high read scenarios and highly distributed across high geographical distances / high latency connections. While mostly used for authentication / authorization, they don't have to be if your use case matches the requirements.
Dimensional Databases are great examples of non-relational databases. They are very commonly used for 'Business Dashboards'/'Business Intelligence' for KPIs and other types of aggregate or statistical data. They are usually populated from relational databases and can offer better performance in certain situations.
http://en.wikipedia.org/wiki/Dimensional_database
XML databases e.g. xindice
Object databases e.g. db4o
Be aware that the concept of relational databases is highly contentious. Purists such as C. J. Date would argue that many databases in common use (such as Oracle and SQL Server) do not comply sufficiently with the relational model to be termed 'relational'.
Non-Relational databases just do not meet Codd’s requirements.
Intersystems Caché seams a total re-write/re-design of the old Pick Operating system’s database. From the little I’ve read of Caché it appears to be a nicely done redesign.
It permits .net programs to access the database just like it would SQL. Caché’s run’s the Pick OS programs without requiring any changes. By importing your Pick files into Caché you can still run your old green screen applications with it, but also write new programs using .net so you can migrate to Windows Applications without abandoning the years of data design you’ve already invested in.
Here is some background on the Pick DB model . A Pick database uses totally variable length records, and fields. All table are keyed by a single unique key and are accessible without reading an index. Pick designed the system to use a Hashing algorithm that reads the item from disk on generally on the 1st physical read (assuming system maintenance was performed correctly). Fields in Pick are Non-Typed. All data is stored as string and Casting is up to the programmer. Nulls are stored as a empty string, thus a null does not take up disk space as it does in SQL. There is no need for Foreign Keys. In the ‘Relational world’ the DBA has to create and Order Header table, and an Order Line Item table. In the “Pick Model’ there is a single table. An example would be, ‘Order Date’ is a field that would store a # of days since ‘Dec 13, 1967’ (the data Pick OS was turned on for the first time). Pick programmers did not have Y2k problems. A 2nd column would be Customer Number. The big difference is when you get to the Product Number Column, it would be ‘Multi-Valued’ (the Codd Non-Conformance). In other words, the database can handle 1-32000 product#s in that column. Other columns like Quantity Ordered would be in a controlling/dependant relationship with the Product Number and would also be multi-valued. When you get to the Quantity Shipped, Pick would go to a third dimension and have Sub-Multi-Valued field. You would have a Shipment Number Column, and it would be Multi-Valued by Line Item and Sub-Multi-Valued containing The Shipment Quantity for that line for that shipment number. There are No Inner Joins needed. All data for that Order is stored in One Table and in a single record. No orphan rows ever!
Secondly the data definition is a bit different. Our dictionaries can contain definitions for data that is not in this table or is being manipulated. A couple of examples are, Customer Name. It would be defined as ‘Use the Customer number column and return the Name field from the Customer Table. Another Example is Line Item Extension would be defined as a calculation of Quantity*Price/PricePer.
I believe I read somewhere Caché claims to have over 100,000 installations.
I would think a flat-file database in Excel is non-relational and used by quite a few people.
It is really just a database table that can not be joined with other tables.
Object-Oriented Databases are one interesting type of non-relational database.
The trading sector sometimes uses OO Databases since each deal/contract can look sort of like others in that category but have unique attributes as well. VERY difficult to represent it relationally.
eXist-db is an xml database that has been around for a long time. It is particularly useful for xquery over tons of xml documents.
Any file or group of files that contains data but does not express relationships within that data is a non-relational database.
RRDtool is designed to store and aggregate log data. You configure a sampling interval and feed data into it, then it returns time-based results. It's optimized for fixed-size storage, and starts aggregating past results after a time. For example, suppose you have a round-robin database with a 5-minute time interval. Even if you send it temperature data once per second, it still only stores the results in 5-minute increments. After a week, it averages those results into hourly values. After a month, hourly results are averaged into daily numbers, and so on.
RRDtool is commonly used as the backend for tools like Cricket and MRTG to track network and environmental data for months and years at a stretch.
For a graph based dbms you have neo4j
For a hierarichal dbms you have any standard filesystem or with "schema" support any LDAP implementation.
There are many answers but they all end up being in one of two major categories:
Navigational. Includes Tree/Hierarchy databases and Graph databases.
Databases that break first normal form (multiple values). Includes Pick databases and Lotus Notes and its offspring like CouchDB.
EDIT: And of course key/value stores like BDB aren't relational, but that just goes without saying doesn't it? I mean, they're just key/value stores.
dBase. Although it was marketed as such, it doesn't meet the requirements.
As an OO database, Intersystems Caché comes to mind. Some medical and library systems are built on this.
In my company, www.smartsgroup.com, we have a proprietary database engine we call a "transaction log database". It is built on flat files, each file containing a sequence of "events" or "messages", in binary format, plus various indexes on this data and algorithms for reproducing the state of a stock exchange's orderbook. It is highly optimised for sequential updates and sequential access.
In scientific applications, it is also common to use proprietary database engines rather than RDBMS's. I also worked for a company that has the world's largest database of EEG brain recordings: www.brainresource.com . There we use a flat file database, and it worked well for us.
SmartsGroup also uses a temporal database, which is like a non-relational database table, except that we store a history of all changes to all fields so we can reproduce the state of a particular row on a particular date.
The Wiki page for Dimensional Databases linked to above seems to have disappeared.
Some OLAP Systems have are backed by Multidimensional databases (MOLAP) these are used often in financial analysis. They afford interactive clients that allow one to navigate through the data at different levels of aggregation.
At my university there is a group that researches deductive databases.

Resources