NimbusDB - distributed, non-blocking, atomic commit protocol? - database

From the NimbusDB website:
Our distributed non-blocking atomic commit protocol allows database transaction processing at any available node.
They claim that they can guarantee ACID transactions in a distributed environment, and provide all of: consistency, high availability and partition tolerance. As far as I can tell from the text, their "secret" for overcoming the limitations of CAP theorem is some sort of "predictable and consistent" way to manage network partitions.
I'm wondering if anyone has some insights or more information on what's behind?

There are multiple possible meanings for the word "consistency". See, e.g., Why is C in CAP theorem not same as C in ACID? .
Plus, some level of debate is also possible as to the meaning of the C in 'ACID' : while it is typically defined in a sense that relates to database integrity ("no transaction shall get to see a database state that violates a declared constraint - modulo the inconsistencies that that transaction has created itself of course"), one commenter said he interpreted it as referring to "the database state as seen (or perhaps better, as effectively used) by any transaction does not change while that transaction is in progress. Paraphrased : transactions are ACID-compliant if they are executing in at least repeatable read mode.
If you take the CAP-C to mean "all nodes see the same data at the same time", then availability is necessarily hampered because while the system is busy distributing the data to the various nodes, it cannot allow any transaction access to (the elder versions of) that data. (Unless of course access to elder versions is precisely what is needed, such as when a transaction is running under MVCC.)
If you take the CAP-C to mean something along the lines of "no transaction can get to see an inconsistent database state", then essentially the same applies, except that it is now the user's update process that should be locking out access for all other transactions.
If you impose a rule to the effect that "whenever a transaction has accessed a particular node N to read from some resource R (assuming R could theoretically be accessed on more than one node), then whenever that transaction accesses R again, it should do so on the same node N.", then I can imagine this will increase your guarantee of "consistency", but you pay in availability, because if node N falls out, then precisely because of the rule imposed, your transaction cannot access R anymore even if it could be done on other nodes.
At any rate, I think that if an institution such as Berkeley comes up with a proof of some theorem, then you're on the safe side if you consider vociferous claims such as the one you mention, as marketing lies.

It's been a while since this post was written and since then NuoDB has added a lot to their product marketing and technical resources on their website.
They've achieve data durability and ACID compliance by using their Distributed Data Cache System. They now call it an "Emergent Architecture:" (p.6-7)
The architecture opens a variety of possible future directions including “time-travel”, the ability to create a copy of the database that recreates its state at an earlier time; “cloud bursting”, the ability to move a database across cloud systems managed by separate groups; and
“coteries” a mechanism that addresses the CAP Theorem by allowing the DBA to specify which systems survive a network partition to provide consistency and partition resistance with continuous availability.
From the How It Works page :
Today’s database vendors have applied three common design patterns around traditional systems to extend them into distributed scale-out database systems. These approaches – Shared-Disk, Shared-Nothing and Synchronous Commit - overcome some of the limitations of single-server deployments, but remain complex and prone to error.
By stepping back and rethinking database design from the ground up, Jim Starkey, NuoDB’s technical founder, has come up with an entirely new design approach called Durable Distributed Cache (DDC). The net effect is a system that scales-out/in dynamically on commodity machines and virtual machines, has no single point of failure, and delivers full ACID transactional semantics.
The primary architectural difference between NuodDB's NewSQL model and that of the more traditional RDMS systems is that the NuoDB inverts the traditional relationship between Memory and Storage, creating an ACID compliant RDBMS with an underlying design similar to that of a distributed DRAM cache. From the NuoDB Durable Distributed Cache page:
All general-purpose relational databases to date have been architected around a storage-centric assumption. Unfortunately this creates a fundamental problem relative to scaling out. In effect, these database systems are fancy file systems that arrange for concurrent read/write access to disk-based files such that users do not interfere with each other.
The NuoDB DDC architecture inverts this idea, imagining the database as a set of in-memory container objects that can overflow to disk if necessary and can be retained in backing stores for durability purposes.
All servers in the NuoDB DDC architecture can request and supply objects (referred to as Atoms) thereby acting as peers to each other. Some servers have a subset of the objects at any given time, and can therefore only supply a subset of the database to other servers. Other servers have all the objects and can supply any of them, but will be slower to supply objects that are not resident in memory.
NuoDB consists of two types of servers: Transaction Engines (TEs) hold a subset of the objects; Storage Managers (SMs) are servers that have a complete copy of all objects. TEs are pure in memory servers that do not need use disks. They are autonomous and can unilaterally load and eject objects from memory according to their needs. Unlike TEs, SMs can’t just drop objects on the floor when they are finished with them; instead they must ensure that they are safely placed in durable storage.
For those familiar with caching architectures, you might have already recognized that these TEs are in effect a distributed DRAM cache, and the SMs are specialized TEs that ensure durability. Hence the name Durable Distributed Cache.
They also publish a technical white paper that deep-dives into the sub-system components and the way they work together to provide an ACID-compliant RDMBS with most of the performance of a NoSQL system (NOTE: registration on their site to download the white paper). The general gist is that they provide an automated network cluster partitioning system that, when combined with their persistent storage system, addresses the concerns the CAP Theorem.
There are also a lot of informative technical white papers and independent analysis reports on their technology in their Online Documents Library

Related

Is it possible to create a relational data schema with multiple tables on blockchain?

Blockchain has been described as a form of database. So far, most blockchain applications seem to involve blockchain as a one-table database.
Is it possible to create a data schema with multiple tables, one-to-many relationship like a relational database on blockchain? If not, why?
If I understand your question, you're asking, "can I treat a blockchain just like a normal relational database, and execute queries?" I assume you basically want a database, with the benefits of "immutability" and decentralization. Technically, the answer is yes, but economically and practically, there are a few things to consider.
Immutability of blockchain is only really achieved if there really is decentralization and a strong decentralized consensus mechanism. For example, Bitcoin has been immutable because there is a lot of distributed hash power around the world, and a 51% attack would be prohibitively expensive. Smaller networks like Bitcoin Gold, can (and were) attacked because they don't have enough hash power to resist an attacker. So you need to make sure you are resistant to this.
If you want to use the blockchain as a database for arbitrary storage, there has to be an economic incentive for users to store strangers' data. There have been many notable blockchain projects attempt this without as much adoption as previously hoped. This is in part because the economic incentive is not great enough. The only incentive Bitcoin users (miners) have to store the block data is to make sure the next mined block will be valid (if they don't store the data, they might mine an invalid block unknowingly because they can't properly validate the transactions).
The technical implementation of this depends on what you are trying to accomplish. If you want to be able to query the blockchain like a database (without necessarily being able to store arbitrary data) these solutions already exist. Different node implementations can use whatever language, storage mechanism, operating system they want, as long as they can communicate and follow the consensus rules. The block data can be stored in a flat file, SQL, noSQL, whateverSQL, the node application just has to be able to read, write, and validate the data. The thing to optimize for here is speed. Since full tx validation requires the entire blockchain, this can be very slow if the database lookups are slow.
Blockchain was invented as a solution to decentralized consensus, and cannot really be taken out of the context it was invented in, which requires not only a technical implementation, but an economic incentive for usage.
Using the blockchain structure directly? No.
However, a blockchain ledger serves very similar purposes as a write-ahead logging (WAL). Quoting from wikipedia: "a family of techniques for providing atomicity and durability (two of the ACID properties) in database systems".
WALs and distributed ledgers are just ways of registering a sequential set of events, where order matters. The main difference is that a WAL does not contain the whole history of events.
The way you construct SQL or any other type of database over blockchain, is by using the ledger as a WAL for SQL instructions. Since the ledger contains the whole history of events, you can always reconstruct the SQL database by executing the history in the same order.

Practical example for each type of database (real cases) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
There are several types of database for different purposes, however normally MySQL is used to everything, because is the most well know Database. Just to give an example in my company an application of big data has a MySQL database at an initial stage, what is unbelievable and will bring serious consequences to the company. Why MySQL? Just because no one know how (and when) should use another DBMS.
So, my question is not about vendors, but type of databases. Can you give me an practical example of specific situations (or apps) for each type of database where is highly recommended to use it?
Example:
• A social network should use the type X because of Y.
• MongoDB or couch DB can't support transactions, so Document DB is not good to an app for a bank or auctions site.
And so on...
Relational: MySQL, PostgreSQL, SQLite, Firebird, MariaDB, Oracle DB, SQL server, IBM DB2, IBM Informix, Teradata
Object: ZODB, DB4O, Eloquera, Versant , Objectivity DB, VelocityDB
Graph databases: AllegroGraph, Neo4j, OrientDB, InfiniteGraph, graphbase, sparkledb, flockdb, BrightstarDB
Key value-stores: Amazon DynamoDB, Redis, Riak, Voldemort, FoundationDB, leveldb, BangDB, KAI, hamsterdb, Tarantool, Maxtable, HyperDex, Genomu, Memcachedb
Column family: Big table, Hbase, hyper table, Cassandra, Apache Accumulo
RDF Stores: Apache Jena, Sesame
Multimodel Databases: arangodb, Datomic, Orient DB, FatDB, AlchemyDB
Document: Mongo DB, Couch DB, Rethink DB, Raven DB, terrastore, Jas DB, Raptor DB, djon DB, EJDB, denso DB, Couchbase
XML Databases: BaseX, Sedna, eXist
Hierarchical: InterSystems Caché, GT.M thanks to #Laurent Parenteau
I found two impressive articles about this subject. All credits to highscalability.com. The information in this answer is transcribed from these articles:
35+ Use Cases For Choosing Your Next NoSQL Database
What The Heck Are You Actually Using NoSQL For?
If Your Application Needs...
• complex transactions because you can't afford to lose data or if you would like a simple transaction programming model then look at a Relational or Grid database.
• Example: an inventory system that might want full ACID. I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!
• to scale then NoSQL or SQL can work. Look for systems that support scale-out, partitioning, live addition and removal of machines, load balancing, automatic sharding and rebalancing, and fault tolerance.
• to always be able to write to a database because you need high availability then look at Bigtable Clones which feature eventual consistency.
• to handle lots of small continuous reads and writes, that may be volatile, then look at Document or Key-value or databases offering fast in-memory access. Also, consider SSD.
• to implement social network operations then you first may want a Graph database or second, a database like Riak that supports relationships. An in-memory relational database with simple SQL joins might suffice for small data sets. Redis' set and list operations could work too.
• to operate over a wide variety of access patterns and data types then look at a Document database, they generally are flexible and perform well.
• powerful offline reporting with large datasets then look at Hadoop first and second, products that support MapReduce. Supporting MapReduce isn't the same as being good at it.
• to span multiple data-centers then look at Bigtable Clones and other products that offer a distributed option that can handle the long latencies and are partition tolerant.
• to build CRUD apps then look at a Document database, they make it easy to access complex data without joins.
• built-in search then look at Riak.
• to operate on data structures like lists, sets, queues, publish-subscribe then look at Redis. Useful for distributed locking, capped logs, and a lot more.
• programmer friendliness in the form of programmer-friendly data types like JSON, HTTP, REST, Javascript then first look at Document databases and then Key-value Databases.
• transactions combined with materialized views for real-time data feeds then look at VoltDB. Great for data-rollups and time windowing.
• enterprise-level support and SLAs then look for a product that makes a point of catering to that market. Membase is an example.
• to log continuous streams of data that may have no consistency guarantees necessary at all then look at Bigtable Clones because they generally work on distributed file systems that can handle a lot of writes.
• to be as simple as possible to operate then look for a hosted or PaaS solution because they will do all the work for you.
• to be sold to enterprise customers then consider a Relational Database because they are used to relational technology.
• to dynamically build relationships between objects that have dynamic properties then consider a Graph Database because often they will not require a schema and models can be built incrementally through programming.
• to support large media then look storage services like S3. NoSQL systems tend not to handle large BLOBS, though MongoDB has a file service.
• to bulk upload lots of data quickly and efficiently then look for a product that supports that scenario. Most will not because they don't support bulk operations.
• an easier upgrade path then use a fluid schema system like a Document Database or a Key-value Database because it supports optional fields, adding fields, and field deletions without the need to build an entire schema migration framework.
• to implement integrity constraints then pick a database that supports SQL DDL, implement them in stored procedures, or implement them in application code.
• a very deep join depth then use a Graph Database because they support blisteringly fast navigation between entities.
• to move behavior close to the data so the data doesn't have to be moved over the network then look at stored procedures of one kind or another. These can be found in Relational, Grid, Document, and even Key-value databases.
• to cache or store BLOB data then look at a Key-value store. Caching can for bits of web pages, or to save complex objects that were expensive to join in a relational database, to reduce latency, and so on.
• a proven track record like not corrupting data and just generally working then pick an established product and when you hit scaling (or other issues) use one of the common workarounds (scale-up, tuning, memcached, sharding, denormalization, etc).
• fluid data types because your data isn't tabular in nature, or requires a flexible number of columns, or has a complex structure, or varies by user (or whatever), then look at Document, Key-value, and Bigtable Clone databases. Each has a lot of flexibility in their data types.
• other business units to run quick relational queries so you don't have to reimplement everything then use a database that supports SQL.
• to operate in the cloud and automatically take full advantage of cloud features then we may not be there yet.
• support for secondary indexes so you can look up data by different keys then look at relational databases and Cassandra's new secondary index support.
• create an ever-growing set of data (really BigData) that rarely gets accessed then look at Bigtable Clone which will spread the data over a distributed file system.
• to integrate with other services then check if the database provides some sort of write-behind syncing feature so you can capture database changes and feed them into other systems to ensure consistency.
• fault tolerance check how durable writes are in the face power failures, partitions, and other failure scenarios.
• to push the technological envelope in a direction nobody seems to be going then build it yourself because that's what it takes to be great sometimes.
• to work on a mobile platform then look at CouchDB/Mobile couchbase.
General Use Cases (NoSQL)
• Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
• Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month (in 2010). Twitter, for example, has the problem of storing 7 TB/data per day (in 2010) with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
• Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
• Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program- and programmer-friendly compatible datatypes like JSON.
• Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
• Write availability. Do your writes need to succeed no matter what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
• Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
• No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
• Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
• Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer-led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
• Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying to find the best fit between a problem and solution.
• Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
• Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
• Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end, this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
• More Specific Use Cases
• Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
• Syncing online and offline data. This is a niche CouchDB has targeted.
• Fast response times under all loads.
• Avoiding heavy joins for when the query load for complex joins become too large for an RDBMS.
• Soft real-time systems where low latency is critical. Games are one example.
• Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. A read-only application which complex query requirements.
• Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
• Real-time inserts, updates, and queries.
• Hierarchical data like threaded discussions and parts explosion.
• Dynamic table creation.
• Two-tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
• Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
• Slicing off part of service that may need better performance/scalability onto its own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
• Caching. A high performance caching tier for websites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
• Real-time page view counters.
• User registration, profile, and session data.
• Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
• Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
• Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
• Working with heterogeneous types of data, for example, different media types at a generic level.
• Embedded systems. They don’t want the overhead of SQL and servers, so they use something simpler for storage.
• A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
• JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3. (source)
• Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
• Fraud detection by comparing transactions to known patterns in real-time.
• Helping diagnose the typology of tumors by integrating the history of every patient.
• In-memory database for high update situations, like a website that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
• Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
• Priority queues.
• Running calculations on cached data, using a program friendly interface, without having to go through an ORM.
• Uniq a large dataset using simple key-value columns.
• To keep querying fast, values can be rolled-up into different time slices.
• Computing the intersection of two massive sets, where a join would be too slow.
• A timeline ala Twitter.
Redis use cases, VoltDB use cases and more find here.
This question is almost impossible to answer because of the generality. I think you are looking for some sort of easy answer problem = solution. The problem is that each "problem" becomes more and more unique as it becomes a business.
What do you call a social network? Twitter? Facebook? LinkedIn? Stack Overflow? They all use different solutions for different parts, and many solutions can exist that use polyglot approach. Twitter has a graph like concept, but there are only 1 degree connections, followers and following. LinkedIn on the other hand thrives on showing how people are connected beyond first degree. These are two different processing and data needs, but both are "social networks".
If you have a "social network" but don't do any discovery mechanisms, then you can easily use any basic key-value store most likely. If you need high performance, horizontal scale, and will have secondary indexes or full-text search, you could use Couchbase.
If you are doing machine learning on top of the log data you are gathering, you can integrate Hadoop with Hive or Pig, or Spark/Shark. Or you can do a lambda architecture and use many different systems with Storm.
If you are doing discovery via graph like queries that go beyond 2nd degree vertexes and also filter on edge properties you likely will consider graph databases on top of your primary store. However graph databases aren't good choices for session store, or as general purpose stores, so you will need a polyglot solution to be efficient.
What is the data velocity? scale? how do you want to manage it. What are the expertise you have available in the company or startup. There are a number of reasons this is not a simple question and answer.
A short useful read specific to database selection: How to choose a NoSQL Database?. I will highlight keypoints in this answer.
Key-Value vs Document-oriented
Key-value stores
If you have clear data structure defined such that all the data would have exactly one key, go for a key-value store. It’s like you have a big Hashtable, and people mostly use it for Cache stores or clearly key based data. However, things start going a little nasty when you need query the same data on basis of multiple keys!
Some key value stores are: memcached, Redis, Aerospike.
Two important things about designing your data model around key-value store are:
You need to know all use cases in advance and you could not change the query-able fields in your data without a redesign.
Remember, if you are going to maintain multiple keys around same data in a key-value store, updates to multiple tables/buckets/collection/whatever are NOT atomic. You need to deal with this yourself.
Document-oriented
If you are just moving away from RDBMS and want to keep your data in as object way and as close to table-like structure as possible, document-structure is the way to go! Particularly useful when you are creating an app and don’t want to deal with RDBMS table design early-on (in prototyping stage) and your schema could change drastically over time. However note:
Secondary indexes may not perform as well.
Transactions are not available.
Popular document-oriented databases are: MongoDB, Couchbase.
Comparing Key-value NoSQL databases
memcached
In-memory cache
No persistence
TTL supported
client-side clustering only (client stores value at multiple nodes). Horizontally scalable through client.
Not good for large-size values/documents
Redis
In-memory cache
Disk supported – backup and rebuild from disk
TTL supported
Super-fast (see benchmarks)
Data structure support in addition to key-value
Clustering support not mature enough yet. Vertically scalable (see Redis Cluster specification)
Horizontal scaling could be tricky.
Supports Secondary indexes
Aerospike
Both in-memory & on-disk
Extremely fast (could support >1 Million TPS on a single node)
Horizontally scalable. Server side clustering. Sharded & replicated data
Automatic failovers
Supports Secondary indexes
CAS (safe read-modify-write) operations, TTL support
Enterprise class
Comparing document-oriented NoSQL databases
MongoDB
Fast
Mature & stable – feature rich
Supports failovers
Horizontally scalable reads – read from replica/secondary
Writes not scalable horizontally unless you use mongo shards
Supports advanced querying
Supports multiple secondary indexes
Shards architecture becomes tricky, not scalable beyond a point where you need secondary indexes. Elementary shard deployment need 9 nodes at minimum.
Document-level locks are a problem if you have a very high write-rate
Couchbase Server
Fast
Sharded cluster instead of master-slave of mongodb
Hot failover support
Horizontally scalable
Supports secondary indexes through views
Learning curve bigger than MongoDB
Claims to be faster

Why are relational databases having scalability issues?

Recenctly I read some articles online that indicates relational databases have scaling issues and not good to use when it comes to big data. Specially in cloud computing where the data is big. But I could not find good solid reasons to why it isn't scalable much, by googling. Can you please explain me the limitations of relational databases when it comes to scalability?
Thanks.
Imagine two different kinds of crossroads.
One has traffic lights or police officers regulating traffic, motion on the crossroad is at limited speed, and there's a watchdog registering precisely what car drove on the crossroad at what time precisely, and what direction it went.
The other has none of that and everyone who arrives at the crossroad at whatever speed he's driving, just dives in and wants to get through as quick as possible.
The former is any traditional database engine. The crossroad is the data itself. The cars are the transactions that want to access the data. The traffic lights or police officer is the DBMS. The watchdog keeps the logs and journals.
The latter is a NOACID type of engine.
Both have a saturation point, at which point arriving cars are forced to start queueing up at the entry points. Both have a maximal throughput. That threshold lies at a lower value for the former type of crossroad, and the reason should be obvious.
The advantage of the former type of crossroad should however also be obvious. Way less opportunity for accidents to happen. On the second type of crossroad, you can expect accidents not to happen only if traffic density is at a much much lower point than the theoretical maximal throughput of the crossroad. And in translation to data management engines, it translates to a guarantee of consistent and coherent results, which only the former type of crossroad (the classical database engine, whether relational or networked or hierarchical) can deliver.
The analogy can be stretched further. Imagine what happens if an accident DOES happen. On the second type of crossroad, the primary concern will probably be to clear the road as quick as possible, so traffic can resume, and when that is done, what info is still available to investigate who caused the accident and how ? Nothing at all. It won't be known. The crossroad is open just waiting for the next accident to happen. On the regulated crossroad, there's the police officer regulating the traffic who saw what happened and can testify. There's the logs saying which car entered at what time precisely, at which entry point precisely, at what speed precisely, a lot of material is available for inspection to determine the root cause of the accident. But of course none of that comes for free.
Colourful enough as an explanation ?
Relational databases provide solid, mature services according to the ACID properties. We get transaction-handling, efficient logging to enable recovery etc. These are core services of the relational databases, and the ones that they are good at. They are hard to customize, and might be considered as a bottleneck, especially if you don't need them in a given application (eg. serving website content with low importance; in this case for example, the widely used MySQL does not provide transaction handling with the default storage engine, and therefore does not satisfy ACID). Lots of "big data" problems don't require these strict constrains, for example web analytics, web search or processing moving object trajectories, as they already include uncertainty by nature.
When reaching the limits of a given computer (memory, CPU, disk: the data is too big, or data processing is too complex and costly), distributing the service is a good idea. Lots of relational and NoSQL databases offer distributed storage. In this case however, ACID turns out to be difficult to satisfy: the CAP theorem states somewhat similar, that availability, consistency and partition tolerance can not be achieved at the same time. If we give up ACID (satisfying BASE for example), scalability might be increased.
See this post eg. for categorization of storage methods according to CAP.
An other bottleneck might be the flexible and clever typed relational model itself with SQL operations: in lots of cases a simpler model with simpler operations would be sufficient and more efficient (like untyped key-value stores). The common row-wise physical storage model might also be limiting: for example it isn't optimal for data compression.
There are however fast and scalable ACID compliant relational databases, including new ones like VoltDB, as the technology of relational databases is mature, well-researched and widespread. We just have to select an appropriate solution for the given problem.
Take the simplest example: insert a row with generated ID. Since IDs must be unique within table, database must somehow lock some sort of persistent counter so that no other INSERT uses the same value. So you have two choices: either allow only one instance to write data or have distributed lock. Both solutions are a major bottle-beck - and is the simplest example!

Master-master vs master-slave database architecture?

I've heard about two kind of database architectures.
master-master
master-slave
Isn't the master-master more suitable for today's web cause it's like Git, every unit has the whole set of data and if one goes down, it doesn't quite matter.
Master-slave reminds me of SVN (which I don't like) where you have one central unit that handles thing.
Questions:
What are the pros and cons of each?
If you want to have a local database in your mobile phone like iPhone, which one is more appropriate?
Is the choice of one of these a critical factor to consider thoroughly?
While researching the various database architectures as well. I have compiled a good bit of information that might be relevant to someone else researching in the future. I came across
Master-Slave Replication
Master-Master Replication
MySQL Cluster
I have decided to settle for using MySQL Cluster for my use case. However please see below for the various pros and cons that I have compiled
1. Master-Slave Replication
Pros
Analytic applications can read from the slave(s) without impacting the master
Backups of the entire database of relatively no impact on the master
Slaves can be taken offline and sync back to the master without any downtime
Cons
In the instance of a failure, a slave has to be promoted to master to take over its place. No automatic failover
Downtime and possibly loss of data when a master fails
All writes also have to be made to the master in a master-slave design
Each additional slave add some load to the master since the binary log have to be read and data copied to each slave
Application might have to be restarted
2. Master-Master Replication
Pros
Applications can read from both masters
Distributes write load across both master nodes
Simple, automatic and quick failover
Cons
Loosely consistent
Not as simple as master-slave to configure and deploy
3. MySQL Cluster
The new kid in town based on MySQL cluster design. MySQL cluster was developed with high availability and scalability in mind and is the ideal solution to be used for environments that require no downtime, high avalability and horizontal scalability.
See MySQL Cluster 101 for more information
Pros
(High Avalability) No single point of failure
Very high throughput
99.99% uptime
Auto-Sharding
Real-Time Responsiveness
On-Line Operations (Schema changes etc)
Distributed writes
Cons
See known limitations
You can visit for my Blog full breakdown including architecture diagrams that goes into further details about the 3 mentioned architectures.
We're trading off availability, consistency and complexity. To address the last question first: Does this matter? Yes very much! The choices concerning how your data is to be managed is absolutely fundamental, and there's no "Best Practice" dodging the decisions. You need to understand your particular requirements.
There's a fundamental tension:
One copy: consistency is easy, but if it happens to be down everybody is out of the water, and if people are remote then may pay horrid communication costs. Bring portable devices, which may need to operate disconnected, into the picture and one copy won't cut it.
Master Slave: consistency is not too difficult because each piece of data has exactly one owning master. But then what do you do if you can't see that master, some kind of postponed work is needed.
Master-Master: well if you can make it work then it seems to offer everything, no single point of failure, everyone can work all the time. The trouble with this is that it is very hard to preserve absolute consistency. See the wikipedia article for more.
Wikipedia seems to have a nice summary of the advantages and disadvantages
Advantages
If one master fails, other masters will continue to update the
database.
Masters can be located in several physical sites i.e.
distributed across the network.
Disadvantages
Most multi-master replication systems are only loosely consistent,
i.e. lazy and asynchronous, violating ACID properties.
Eager replication systems are complex and introduce some
communication latency.
Issues such as conflict resolution can become intractable as
the number of nodes involved rises and the required latency decreases.

Keeping distributed databases synchronized in a unstable network

I'm facing the following challenge:
I have a bunch of databases in different geographical locations where the network may fail a lot (I'm using cellular network). I need to keep all the databases synchronized but there is no need to be in real time. I'm using Java but I have the freedom to choose any free database.
How can I achieve this?
It's a problem with a quite established corpus of research (of which people is apparently unaware). I suggest to not reinvent a poor, defective wheel if not absolutely necessary (such as, for example, so unusual requirements to allow a trivial solution).
Some keywords: replication, mobile DBMSs, distributed disconnected DBMSs.
Also these research papers are relevant (as an example of this research field):
Distributed disconnected databases,
The dangers of replication and a solution,
Improving Data Consistency in Mobile Computing Using Isolation-Only Transactions,
Dealing with Server Corruption in Weakly Consistent, Replicated Data Systems,
Rumor: Mobile Data Access Through Optimistic Peer-to-Peer Replication,
The Case for Non-transparent Replication: Examples from Bayou,
Bayou: replicated database services for world-wide applications,
Managing update conflicts in Bayou, a weakly connected replicated storage system,
Two-level client caching and disconnected operation of notebook computers in distributed systems,
Replicated document management in a group communication system,
... and so on.
I am not aware of any databases that will give you this functionality out of the box; there is a lot of complexity here due to the need for eventual consistency and conflict resolution (eg, what happens if the network gets split into 2 halves, and you update something to the value 123 while I update it on the other half to 321, and then the networks reconnect?)
You may have to roll your own.
For some ideas on how to do this, check out the design of Yahoo's PNUTS system: http://research.yahoo.com/node/2304 and Amazon's Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Check out SymmetricDS. SymmetricDS is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time. The software was designed to scale for a large number of databases, work across low-bandwidth connections, and withstand periods of network outage.
I don't know your requirements or your apps, but this isn't a quick answer type of question. I'm very interested to see what others have to say. However, I have a suggestion that may or may not work for you, depending on your requirements and situation. particularly, this will not help if your users need to use the app even when the network is unavailable (offline access).
Keeping a bunch of small databases synchronized is a fairly complex task to do correctly. Is there any possibility of just having one centralized database, and either having the client applications connect directly to it or (my preferred solution) write some web services to handle accessing/updating data rather than having a bunch of client databases?
I realize this limits offline access, but there are various caching strategies you can use. (Which of course, leads you back to your original question.)

Resources