Can you recommend a database that scales horizontally? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Generally the database server is the biggest, most expensive box we have to buy as scaling vertically is the only option. Are there any databases that scale well horizontally (i.e. across multiple commodity machines) and what are the limitations in this approach?

Oracle RAC is not horizontally scalable at all, because all Oracle instances share the same data storage. Yes, with SAN stuff u can get a large size DB, but it's just not scalable at all. In other words, Oracle RAC is still a scale-up approach. So for scaling-out or horizontally scaling, you have to partition your data by function that means put different groups of tables in different databases; or partition your data per table that means partition one table into multiple subtables with the same schema but store in different databases. In this way, you get a scaling-out solution. There are many resources on that. Sharding has been a buzz word for a while in web 2.0 website architecture blog sphere.
Because Sharding is not directly supported by database itself, you have to build your own solution. But as I said, there are many lessons already. For oracle, partitioning table is possible. For mysql, check this question

Oracle RAC -- Real Application Cluster
This works nicely, you just add boxes to your cluster. You can fail over from one box to the other. It's not replication, all the boxes are part of the same logical unit.
It's pretty spendy, of course.

Don't worry, good solutions are coming!
Couchdb and Hypertable are open source and still in alpha, but they are clearly designed to make scaling on commodity software simple. They work pretty well, and may change how you think about databases.
Also, if it's okay to let someone else do the distributing for you, Google AppEngine and Amazon SimpleDB are extremely cheap distributed database services, though they're both in beta right now so strict limitations are imposed.

There are storage techniques such as JavaSpaces (or a commercial implementation such as Gigaspaces) which provide highly scalable, fast & secure access to objects.
There are also distributed cacheing systems such as memcached, which offer a similar approach.
Of course, neither of these are true databases, but they are things that can work in conjunction with databases to offer a large amount of horizontal scalability, given a suitable architecture. The real problem is that if you want all of the ACID goodness that comes with a database, there are certain unavoidable performance penalties. The only way out is to figure out the bits where you don't need ACID, and use other technologies to service those bits.

Oracle RAC is the Rolls Royce of databases allowing extra hardware nodes to be added relatively easily and hardware failover.
However, your commodity hardware costs will be dwarfed by the licence costs.
Why dod you feel you need horizontal scaling. A multi CPU core server with 40GB RAM and SAN storage can support very sizeable DB installation.
Can you provide any sizing and expected activity information to allow better understanding of your problem?

If you do go down the RAC route it's worth remembering that it doesnt scale horizontally forever. Even the sales guys admit 90% of rac customers are 4 nodes or less. Once you go more than that you get diminishing returns. So rac may work for you, but it's not guaranteed to be the answer.

MySQL: http://www.mysql.com/why-mysql/scaleout.html
Limitations are that it works best with read-mostly workloads. You typically have one 'master' that receives all the writes, and many 'slaves' that replicate the writes. Then you distribute the reads over all the databases.
MySQL replication is asynchronous, so you will probably have to deal with time lag problems (you write to the master, and then read from a slave before the write has been replicated).

Netezza and other datawarehouse appliances scale this way, but they are not good for OLTP and web app workloads.

The Oracle route for scaling across multiple machines is called Real Application Clusters (Oracle RAC). There's no end of documentation on this elsewhere; you might try starting at http://www.oracle.com/database/rac_home.html.

MongoDB
is one of the best database that scales horizontally.

Oracle Real Application Clusters. If you want the best then take a look at it.

If you seriously think you will out scale a decent multicore "Big Iron" box, then you think about partitioning your data. This is a good, database agnostic way to scale out.
All databases which horizontally will come at a serious cost.
Unless you have mega $$'s to throw at the problem, forget about RAC. While its very good, its VERY expensive once you scale beyond 2 nodes.

You might look at DashDB for OLAP -- IBM pairs it with Cloudant for OLTP.
https://www.ibm.com/developerworks/community/blogs/5things/entry/5_things_to_know_about_dashdb_placeholder?lang=en

Related

Difference between scaling horizontally and vertically for databases [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have come across many NoSQL databases and SQL databases. There are varying parameters to measure the strength and weaknesses of these databases and scalability is one of them. What is the difference between horizontally and vertically scaling these databases?
Horizontal scaling means that you scale by adding more machines into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.
An easy way to remember this is to think of a machine on a server rack, we add more machines across the horizontal direction and add more resources to a machine in the vertical direction.
                 
In the database world, horizontal-scaling is often based on the partitioning of the data i.e. each node contains only part of the data, in vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.
With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool - Vertical-scaling is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit.
Good examples of horizontal scaling are Cassandra, MongoDB, Google Cloud Spanner .. and a good example of vertical scaling is MySQL - Amazon RDS (The cloud version of MySQL). It provides an easy way to scale vertically by switching from small to bigger machines. This process often involves downtime.
In-Memory Data Grids such as GigaSpaces XAP, Coherence etc.. are often optimized for both horizontal and vertical scaling simply because they're not bound to disk. Horizontal-scaling through partitioning and vertical-scaling through multi-core support.
You can read more on this subject in my earlier posts:
Scale-out vs Scale-up and The Common Principles Behind the NOSQL Alternatives
Scaling horizontally ===> Thousands of minions will do the work together for you.
Scaling vertically ===> One big hulk will do all the work for you.
Let's start with the need for scaling that is increasing resources so that your system can now handle more requests than it earlier could.
When you realise your system is getting slow and is unable to handle the current number of requests, you need to scale the system.
This provides you with two options. Either you increase the resources in the server which you are using currently, i.e, increase the amount of RAM, CPU, GPU and other resources. This is known as vertical scaling.
Vertical scaling is typically costly.
It does not make the system fault tolerant, i.e if you are scaling application running with single server, if that server goes down, your system will go down.
Also the amount of threads remains the same in vertical scaling.
Vertical scaling may require your system to go down for a moment when process takes place. Increasing resources on a server requires a restart and put your system down.
Another solution to this problem is increasing the amount of servers present in the system. This solution is highly used in the tech industry.
This will eventually decrease the request per second rate in each server.
If you need to scale the system, just add another server, and you are done. You would not be required to restart the system.
Number of threads in each system decreases leading to high throughput.
To segregate the requests, equally to each of the application server, you need to add load balancer which would act as reverse proxy to the web servers. This whole system can be called as a single cluster.
Your system may contain a large number of requests which would require more amount of clusters like this.
Hope you get the whole concept of introducing scaling to the system.
There is an additional architecture that wasn't mentioned - SQL-based database services that enable horizontal scaling without the complexity of manual sharding. These services do the sharding in the background, so they enable you to run a traditional SQL database and scale out like you would with NoSQL engines like MongoDB or CouchDB. Two services I am familiar with are EnterpriseDB for PostgreSQL and Xeround for MySQL. I saw an in-depth post by Xeround which explains why scale-out on SQL databases is difficult and how they do it differently - treat this with a grain of salt as it is a vendor post. Also check out Wikipedia's Cloud Database entry, there is a nice explanation of SQL vs. NoSQL and service vs. self-hosted, a list of vendors and scaling options for each combination. ;)
Yes scaling horizontally means adding more machines, but it also implies that the machines are equal in the cluster. MySQL can scale horizontally in terms of Reading data, through the use of replicas, but once it reaches capacity of the server mem/disk, you have to begin sharding data across servers. This becomes increasingly more complex. Often keeping data consistent across replicas is a problem as replication rates are often too slow to keep up with data change rates.
Couchbase is also a fantastic NoSQL Horizontal Scaling database, used in many commercial high availability applications and games and arguably the highest performer in the category. It partitions data automatically across cluster, adding nodes is simple, and you can use commodity hardware, cheaper vm instances (using Large instead of High Mem, High Disk machines at AWS for instance). It is built off the Membase (Memcached) but adds persistence. Also, in the case of Couchbase, every node can do reads and writes, and are equals in the cluster, with only failover replication (not full dataset replication across all servers like in mySQL).
Performance-wise, you can see an excellent Cisco benchmark: http://blog.couchbase.com/understanding-performance-benchmark-published-cisco-and-solarflare-using-couchbase-server
Here is a great blog post about Couchbase Architecture: http://horicky.blogspot.com/2012/07/couchbase-architecture.html
Traditional relational databases were designed as client/server database systems. They can be scaled horizontally but the process to do so tends to be complex and error prone. NewSQL databases like NuoDB are memory-centric distributed database systems designed to scale out horizontally while maintaining the SQL/ACID properties of traditional RDBMS.
For more information on NuoDB, read their technical white paper.
SQL databases like Oracle, db2 also support Horizontal scaling through Shared disk cluster. For example Oracle RAC, IBM DB2 purescale or Sybase ASE Cluster edition. New node can be added to Oracle RAC system or DB2 purescale system to achieve horizontal scaling.
But the approach is different from noSQL databases (like mongodb, CouchDB or IBM Cloudant) is that the data sharding is not part of Horizontal scaling. In noSQL databases data is shraded during horizontal scaling.
The accepted answer is spot on the basic definition of horizontal vs vertical scaling. But unlike the common belief that horizontal scaling of databases is only possible with Cassandra, MongoDB, etc I would like to add that horizontal scaling is also very much possible with any traditional RDMS; that too without using any third party solutions.
I know of many companies, specially SaaS based companies that do this. This is done using simple application logic. You basically take a set of users and divide them over multiple DB servers. So for example, you would typically have a "meta" database/table that would store clients, DB server/connection strings, etc and a table that stores client/server mapping.
Then simply direct requests from each client to the DB server they are mapped to.
Now some may say this is akin to horizontal partitioning and not "true" horizontal scaling and they will be right in some ways. But the end result is that you have scaled your DB over multiple Db servers.
The only difference between the two approaches to horizontal scaling is that one approach (MongoDB, etc) the scaling is done by the DB software itself. In that sense you are "buying" the scaling. In the other approach (for RDBMS horizontal scaling), the scaling is built by application code/logic.
Adding lots of load balancers creates extra overhead and latency and that is the drawback for scaling out horizontally in nosql databases. It is like the question why people say RPC is not recommended since it is not robust.
I think in a real system we should use both sql and nosql databases to utilize both multicore and cloud computing capabilities of today's systems.
On the other hand, complex transactional queries has high performance if sql databases such as oracle being used. NoSql could be used for bigdata and horizontal scalability by sharding.
You have a company and there is only 1 worker but you got 1 new project at that time you hire new candidate -- this is horizontal scaling. where new candidate is new machines and project is new traffic/calls to your api's.
Where as 1 project with an IIT/NIT guy handling all request to your api/traffic. If any time more request to your api's then fire him and replacing him with a high IQ NIT/IIT guy -- this is vertical scaling.

why everyone wants NOSQL other than large-scale Oracle Clusters?

oracle has a good reputation for handling large-scale applications and it's also flexible to extend to cluster environment. Why everyone wants NOSQL?
because nosql db is much cheaper?
why not swith object-oriented db?
Firstly, not everyone does want NoSQL. Packaged software (eg ERP) is all pretty much mainstream RDBMS stuff. Don't confuse the amount of development effort with the usage.
What has happened is that NoSQL has opened up a whole range of applications that simply didn't suit relational technologies, and so there's been a rush of applications that CAN be developed. Most of the stuff that could be developed on RDBMS platforms already has been and is in more of a maintenance/upgrade phase. Probably with less upgrading than usual because of the whole global financial climate.
So in ten/fifteen years, as those NoSQL apps move into the same level of maturity, the frenzy will have died down and there's be less excitement.
Oracle also have NoSQL solution: http://www.oracle.com/technetwork/database/nosqldb/overview/index.html
A SQL solution isn't necessarily what you want regardless of scale.
There are situations where you can't easily predict the schema of your model, or worst still your data is schemaless. In those situations you want a data model which doesn't limit you but rather allows you the flexibilty you need to evolve your data while still maintaining core abilities such as fast indexes.
Another reason is that SQL doesn't represent the natural way in which you want to look at your data, predominantly graph DBs such as Neo4J or GraphDB allow developers or users to approach a linked graph model in a more intuitive way.
Of course there is a way to address all these issues in Oracle RDBMS, but it feels more like hacking the DB to fit your needs as opposed to using a DB that fits you. This sounds like a perk but it actually goes a long way into the ease of development and analysis of your application.
Now if we are talking about scale, Oracle can probably beat column based DBs such as HBase or Hypertable, but it is important to note that Oracle RDBMS isn't just more expensive it's way way more expensive. In today's world even small time startups have Terabytes of data they need to analyze daily. Even small companies can use clusters of 100 machines in the cloud to store their data, in such a company Oracle isn't a viable option the annual licensing cost and the hiring of DBAs will prevent startups from using it.
Finally the last reason why you would start off with NoSQL is speed, bringing up a MongoDB and starting development can be done in 5 minutes and sometimes you wanna handle problems as they come up and avoid premature optimization
If you are willing to let go consistency, there are no theorical limits to how mouch you can scale out some NoSQL solutions.
Some RDBMS can scale a lot, and Oracle is among the best of them, but no RDBMS let you cut in consistency, and therefor even the best has a pretty clear theorical limit to how much it can scale, not to mention the real world limits.
Some big names on the net just can't rely only on RDBMS anymore, and many others follows just to be like the big ones. Finally, some solutions are realy best fitted on a scheam-less structure, but those don't account for the most part of NoSQL users I would guess. The main point is to scale the web 2.0.
This is a very general question you are asking. Are you comparing the relational database to NOSQL database, or are you comparing commercial or open source database? We need to figure out as it seems we are comparing apples to oranges, and you will not get a straight answer.
Here are the breakdown from my perspective.
DB type: If you are comparing relational db vs NOSQL db, you should refer to this link
instead.
Cost: If you are comparing from a cost perspective, each has its own cost. Oracle will charge for the license, while the NOSQL db (using MongoDB as an example) are open source, and you don't have to pay the license. But you will need someone to have a good understanding of NOSQL to administer and maintain the site, which is a lot harder to find.
Application: What kind of application are you writing? You need to understand the database requirement for your application. If you need to store a lot of unstructured data as a key value entry, you are better off using NOSQL. On the other side of coin, you would prefer SQL db if you will join a lot of tables and execute complex SQL.
You mentioned Object Oriented DB, and that is another type of database from NoSQL or relational DB. At the end, it depends on your need. To expand the choice of horizon, some might heard of hierarchical database, which mainly live in mainframe environment. :-)

How do the newer database models achieve better scalability and performance as compared to a traditional RDBMS implementation?

We have
BigTable from Google,
Hadoop, actively contributed by Yahoo,
Dynamo from Amazon
all aiming towards one common goal - making data management as scalable as possible.
By scalability what I understand is that the cost of the usage should not go up drastically when the size of data increases.
RDBMS's are slow when the amount of data is large as the number of indirections invariable increases leading to more IO's.
How do these custom scalable friendly data management systems solve the problem?
This is a figure from this document explaining Google BigTable:
Looks the same to me. How is the ultra-scalability achieved?
The "traditional" SQL DBMS market really means a very small number of products, which have traditionally targeted business applications in a corporate setting. Massive shared-nothing scalability has not historically been a priority for those products or their customers. So it is natural that alternative products have emerged to support internet scale database applications.
This has nothing to do with the fact that these new products are not "Relational" DBMSs. The relational model can scale just as well as any other model. Arguably the relational model suits these types of massively scalable applications better than say, network (graph based) models. It's just that the SQL language has a lot of disadvantages and no-one has yet come up with suitable relational NOSQL (non-SQL) alternatives.
Speaking specifically to your question about Bigtable, the difference is that the heirarchy in the diagram above is all there is. Each Bigtable tabletserver is responsible for a set of tablets (contiguous row ranges from a table); the mapping from row range to tablet is maintained in the metadata table, while the mapping from tablet to tabletserver is maintained in the memory of the Bigtable master. Looking up a row, or range of rows, requires looking up the metadata entry (which will almost certainly be in memory on the server that hosts it), then using that to look up the actual row on the server responsible for it - resulting in only one, or a few disk seeks.
In a nutshell, the reason this scales well is because it's possible to throw more hardware at it: given enough resources, the metadata is always in memory, and thus there's no need to go to disk for it, only for the data (and not always for that, either!).
It's about using cheap comodity hardware to build a network/grid/cloud and spread the data and load (for example using map/reduce).
RDBMS databases seem to me like software being (originaly) designed to run on one supercomputer. You can use various hard drive arrays, DB clusters, but still..
The amount of data increased so there's one more reason to design new data storages with this in mind - scalability, high availability, terabytes of data.
Another thing - if you build a grid/cloud from cheap servers, it's fault tolerant because you store all data at three (?) different locations and at the same time it's cheap.
Back to your pictures - the first one is from one computer (typically), the second one from a network of computers.
One theoretical answer on scalability is at http://queue.acm.org/detail.cfm?id=1394128 - the ACID guarantees are expensive. See http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf for a counter-argument.
In fact just surviving power failures is expensive. Years ago now I compared MySQL against Oracle. MySQL was almost unbelieveably faster than Oracle, but we couldn't use it. MySQL of those days was built on top of Berkeley
DB, which was miles faster than Oracle's full blown log-based database, but if the power went off while Berkely DB based MySQL was running, it was a manual process to get the database consistent again when the power went back on, and you'ld probably lose recent updates for good.

When NOT to use Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
There has been a lot of talk related to Cassandra lately.
Twitter, Digg, Facebook, etc all use it.
When does it make sense to:
use Cassandra,
not use Cassandra, and
use a RDMS instead of Cassandra.
There is nothing like a silver bullet, everything is built to solve specific problems and has its own pros and cons. It is up to you, what problem statement you have and what is the best fitting solution for that problem.
I will try to answer your questions one by one in the same order you asked them. Since Cassandra is based on the NoSQL family of databases, it's important you understand why use a NoSQL database before I answer your questions.
Why use NoSQL
In the case of RDBMS, making a choice is quite easy because all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost the same kind of solutions oriented toward ACID properties. When it comes to NoSQL, the decision becomes difficult because every NoSQL database offers different solutions and you have to understand which one is best suited for your app/system requirements. For example, MongoDB is fit for use cases where your system demands a schema-less document store. HBase might be fit for search engines, analyzing log data, or any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like trees, queues, linked lists, etc and can be a good fit for making real-time leaderboards, pub-sub kind of system. Similarly there are other databases in this category (Including Cassandra) which are fit for different problem statements. Now lets move to the original questions, and answer them one by one.
When to use Cassandra
Being a part of the NoSQL family, Cassandra offers a solution for problems where one of your requirements is to have a very heavy write system and you want to have a quite responsive reporting system on top of that stored data. Consider the use case of Web analytics where log data is stored for each request and you want to built an analytical platform around it to count hits per hour, by browser, by IP, etc in a real time manner. You can refer to this blog post to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra
Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make a workaround for that, however you will end up writing lots of application code to simulate ACID properties and will lose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
When not to use Cassandra
I don't think it needs to be answered if the above explanation makes sense.
When evaluating distributed data systems, you have to consider the CAP theorem - you can pick two of the following: consistency, availability, and partition tolerance.
Cassandra is an available, partition-tolerant system that supports eventual consistency. For more information see this blog post I wrote: Visual Guide to NoSQL Systems.
Cassandra is the answer to a particular problem: What do you do when you have so much data that it does not fit on one server ? How do you store all your data on many servers and do not break your bank account and not make your developers insane ? Facebook gets 4 Terabyte of new compressed data EVERY DAY. And this number most likely will grow more than twice within a year.
If you do not have this much data or if you have millions to pay for Enterprise Oracle/DB2 cluster installation and specialists required to set it up and maintain it, then you are fine with SQL database.
However Facebook no longer uses cassandra and now uses MySQL almost exclusively moving the partitioning up in the application stack for faster performance and better control.
The general idea of NoSQL is that you should use whichever data store is the best fit for your application. If you have a table of financial data, use SQL. If you have objects that would require complex/slow queries to map to a relational schema, use an object or key/value store.
Of course just about any real world problem you run into is somewhere in between those two extremes and neither solution will be perfect. You need to consider the capabilities of each store and the consequences of using one over the other, which will be very much specific to the problem you are trying to solve.
Besides the answers given above about when to use and when not to use Cassandra, if you do decide to use Cassandra you may want to consider not using Cassandra itself, but one of the its many cousins out there.
Some answers above already pointed to various "NoSQL" systems which share many properties with Cassandra, with some small or large differences, and may be better than Cassandra itself for your specific needs.
Additionally, recently (several years after this question was originally asked), a Cassandra clone called Scylla (see https://en.wikipedia.org/wiki/Scylla_(database)) was released. Scylla is an open-source re-implementation of Cassandra in C++, which claims to have significantly higher throughput and lower latencies than the original Java Cassandra, while being mostly compatible with it (in features, APIs, and file formats). So if you're already considering Cassandra, you may want to consider Scylla as well.
I will focus here on some of the important aspects which can help you to decide if you really need Cassandra. The list is not exhaustive, just some of the points which I have at top of my mind-
Don't consider Cassandra as the first choice when you have a strict requirement on the relationship (across your dataset).
Cassandra by default is AP system (of CAP). But, it supports tunable consistency which means it can be configured to support as CP as well. So don't ignore it just because you read somewhere that it's AP and you are looking for CP systems. Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.
Don't use Cassandra if your scale is not much or if you can deal with a non-distributed DB.
Think harder if your team thinks that all your problems will be solved if you use distributed DBs like Cassandra. To start with these DBs is very simple as it comes with many defaults but optimizing and mastering it for solving a specific problem would require a good (if not a lot) amount of engineering effort.
Cassandra is column-oriented but at the same time each row also has a unique key. So, it might be helpful to think of it as an indexed, row-oriented store. You can even use it as a document store.
Cassandra doesn't force you to define the fields beforehand. So, if you are in a startup mode or your features are evolving (as in agile) - Cassandra embraces it. So better, first think about queries and then think about data to answer them.
Cassandra is optimized for really high throughput on writes. If your use case is read-heavy (like cache) then Cassandra might not be an ideal choice.
Right. It makes sense to use Cassandra when you have a huge amount of data, a huge number of queries but very little variety of queries. Cassandra basically works by partitioning and replicating. If all your queries will be based on the same partition key, Cassandra is your best bet. If you get a query on an attribute that is not the partition key, Cassandra allows you to replicate the whole data with a new partition key. So now you have 2 replicas of the same data with 2 different partition keys.
Which brings me to your next question. When not to use Cassandra. As I mentioned, Cassandra scales by replicating the complete database for every new partitioning key. But you can't keep making new copies again and again. So when you have a high variety in queries i.e. each query has a different column in the where clause, Cassandra is not a good option.
Now for the third question. The whole point of using RDBMS is when you want the ACID properties. If you are building something like a payment service and want each transaction to be isolated, each transaction to either complete or not happen at all, changes to be persistent despite system failure, and the money to be consistent across bank accounts before and after the transaction completes, an RDBMS is the only option that will help you achieve this.
This article actually explains the whole thing, especially when to use Cassandra or not (as opposed to some other NoSQL option) part of the question -> Choosing the best Database. Do check it out.
EDIT: To answer the question in the comments by proximab, when we think of banking systems we immidiately think "ACID is the best solution". But even banking systems are made up of several subsystems that might not even be dealing with any transaction related data like account holder's personal information, account statements, credit card details, credit histories, etc.
All of this information needs to be stored in some database or the another. Now if you store the account related information like account balance, that is something that needs to be consistent at all times. For example, if you try to send money from account A to account B, then the money that disappears from account A should instantaneousy show up in account B, and it cannot be present in both accounts at the same time. This system cannot be inconsistant at any point. This is where ACID is of utmost importance.
On the other hand if you are saving credit card details or credit histories, that should not get into the wrong hands, then you need something that allows access only to authorised users. That I believe is supported by Cassandra. That said, data like credit history and credit card transactions, I think that is an ever increasing data. Also there is only so much yo can query on this data i.e. it has a very finite number of queries. These two conditions make Cassandra a perfect solution.
Talking with someone in the midst of deploying Cassandra, it doesn't handle the many-to-many well. They are doing a hack job to do their initial testing. I spoke with a Cassandra consultant about this and he said he wouldn't recommend it if you had this problem set.
You should ask your self the following questions:
(Volume, Velocity) Will you be writing and reading TONS of information , so much information that no one computer could handle the writes.
(Global) Will you need this writing and reading capability around the world so that the writes in one part of the world are accessible in another part of the world?
(Reliability) Do you need this database to be up and running all the time and never go down regardless of which Cloud, which country, whether it's VM , Container, or Bare metal?
(Scale-ability) Do you need this database to be able to continue to grow easily and scale linearly
(Consistency) Do you need TUNABLE consistency where some writes can happen asynchronously where as others need to be certified?
(Skill) Are you willing to do what it takes to learn this technology and the data modeling that goes with creating a globally distributed database that can be fast for everyone, everywhere?
If for any of these questions you thought "maybe" or "no," you should use something else. If you had "hell yes" as an answer to all of them, then you should use Cassandra.
Use RDBMS when you can do everything on one box. It's probably easier than most and anyone can work with it.
Heavy single query vs. gazillion light query load is another point to consider, in addition to other answers here. It's inherently harder to automatically optimize a single query in a NoSql-style DB. I've used MongoDB and ran into performance issues when trying to calculate a complex query. I haven't used Cassandra but I expect it to have the same issue.
On the other hand, if your load is expected to be that of very many small queries, and you want to be able to easily scale out, you could take advantage of eventual consistency that is offered by most NoSql DBs. Note that eventual consistency is not really a feature of a non-relational data model, but it is much easier to implement and to set up in a NoSql-based system.
For a single, very heavy query, any modern RDBMS engine can do a decent job parallelizing parts of the query and take advantage of as much CPU and memory you throw at it (on a single machine). NoSql databases don't have enough information about the structure of the data to be able to make assumptions that will allow truly intelligent parallelization of a big query. They do allow you to easily scale out more servers (or cores) but once the query hits a complexity level you are basically forced to split it apart manually to parts that the NoSql engine knows how to deal with intelligently.
In my experience with MongoDB, in the end because of the complexity of the query there wasn't much Mongo could do to optimize it and run parts of it on multiple data. Mongo parallelizes multiple queries but isn't so good at optimizing a single one.
Let's read some real world cases:
http://planetcassandra.org/apache-cassandra-use-cases/
In this article: http://planetcassandra.org/blog/post/agentis-energy-stores-over-15-billion-records-of-time-series-usage-data-in-apache-cassandra
They elaborated the reason why they didn't choose MySql is because db synchronization is too slow.
(Also due to 2-phrase commit, FK, PK)
Cassandra is based on Amazon Dynamo paper
Features:
Stability
High availability
Backup performs well
Read and Write is better than HBase, (BigTable clone in java).
wiki http://en.wikipedia.org/wiki/Apache_Cassandra
Their Conclusion is:
We looked at HBase, Dynamo, Mongo and Cassandra.
Cassandra was simply the best storage solution for the majority of our data.
As of 2018,
I would recommend using ScyllaDB to replace classic cassandra, if you need back support.
Postgres kv plugin is also quick than cassandra. How ever won't have multi-instance scalability.
another situation that makes the choice easier is when you want to use aggregate function like sum, min, max, etcetera and complex queries (like in the financial system mentioned above) then a relational database is probably more convenient then a nosql database since both are not possible on a nosql databse unless you use really a lot of Inverted indexes. When you do use nosql you would have to do the aggregate functions in code or store them seperatly in its own columnfamily but this makes it all quite complex and reduces the performance that you gained by using nosql.
Cassandra is a good choice if:
You don't require the ACID properties from your DB.
There would be massive and huge number of writes on the DB.
There is a requirement to integrate with Big Data, Hadoop, Hive and Spark.
There is a need of real time data analytics and report generations.
There is a requirement of impressive fault tolerant mechanism.
There is a requirement of homogenous system.
There is a requirement of lots of customisation for tuning.
If you need a fully consistent database with SQL semantics, Cassandra is NOT the solution for you. Cassandra supports key-value lookups. It does not support SQL queries. Data in Cassandra is "eventually consistent". Concurrent lookups of data may be inconsistent, but eventually lookups are consistent.
If you need strict semantics and need support for SQL queries, choose another solution such as MySQL, PostGres, or combine use of Cassandra with Solr.
Apache cassandra is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.
The archichecture is purely based on the cap theorem, which is availability , and partition tolerance, and interestingly eventual consistently.
Dont Use it, if your not storing volumes of data across racks of clusters,
Dont use if you are not storing Time series data,
Dont Use if you not patitioning your servers,
Dont use if you require strong Consistency.
Mongodb has very powerful aggregate functions and an expressive aggregate framework. It has many of the features developers are accustomed to using from the relational database world. It's document data/storage structure allows for more complex data models than Cassandra, for example.
All this comes with trade-offs of course. So when you select your database (NoSQL, NewSQL, or RDBMS) look at what problem you are trying to solve and at your scalability needs. No one database does it all.
According to DataStax, Cassandra is not the best use case when there is a need for
1- High end hardware devices.
2- ACID compliant with no roll back (bank transaction)
It does not support complete transaction management across the
tables.
Secondary Index not supported.
Have to rely on Elastic search /Solr for Secondary index and the custom sync component has to be written.
Not ACID compliant system.
Query support is limited.

What's the "best" database for embedded? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm an embedded guy, not a database guy. I've been asked to redesign an existing system which has bottlenecks in several places.
The embedded device is based around an ARM 9 processor running at 220mHz.
There should be a database of 50k entries (may increase to 250k) each with 1k of data (max 8 filed). That's approximate - I can try to get more precise figures if necessary.
They are currently using SqlLite 2 and planning to move to SqlLite 3.
Without starting a flame war - I am a complete d/b newbie just seeking advice - is that the "best" decision? I realize that this might be a "how long is a piece of string?" question, but any pointers woudl be greatly welcomed. I don't mind doing a lot of reading & research, but just hoped that you could get me off to a flying start. Thanks.
p.s Again, a total rewrite, might not even stick with embedded Linux, but switch to eCos, don't worry too much about one time conversion between d/b formats. Oh, and accesses should be infrequent, at most one every few seconds.
edit: ok, it seems they have 30k entries (may reach 100k or more) of only 5 or 6 fields each, but at least 3 of them can be a search key for a record. They are toying with "having no d/b at all, since the data are so simple", but it seems to me that with multiple keys, we couldn't use fancy stuff like a quicksort() type search (recursive, binary search). Any thoughts on "no d/b", just data-structures?
Btw, one key is 800k - not sure how well SqlLite handles that (maybe with "no d/b" I have to hash that 800k to something smaller?)
Also SQLite is the Database chosen by virtually all mobile operating systems. Android, Iphone OS and Symbian ship with SQLite which makes me think that manpower was spent to optimize it for the processor in those phones (nearly always ARM).
I would stick with SQLite, it's widely supported and pretty rich in features.
Firebird (previously Interbase) claims to work well embedded.
HypersonicQL (HQL) is small and fast and also claims to be suitable for embedded use.
Alas, I have no personal experience to back up either claim.
SQLite is probably a pretty safe bet. However, if performance is really important for your application and you do not need a relational database, I would suggest you take a look at Berkeley DB link text . Berkeley DB is not a relational database though. In other words, if your data is grouped in different tables and you constantly need to query result sets that require relating data from more than one table, you probably need a relational database. Berkeley DB is better suited for something like look up tables (i.e., the data is organized in a few tables and you don't need to query data from more than one of them in order to produce the result sets you want). Berkeley DB is very fast but it will require more work on your end in order to get the most out of it.
if you want an alternative, then berkeleydb is worth looking at. it used to be owned by sleepycat software, but is now available from oracle. it's a barebones database engine; is directly programmable (rather than a sql) frontend. it's used as part of the core engine in many major databases, and as the database in many embedded devices - it used to be particularly popular for managing routing tables in routers.
it tends to get overlooked these days for more fashionable setups, but i've found it to be decent, solid and for the numbers you are talking about it can be lightning fast.
I will suggest sqlite3 too.
It is used by many famous application.
SQLite is ok, but don't plan to use if you plan to insert, update and delete data that involves more that 6 millon rows(All at the same time, or any partial part). The thing is that the VACCUM keyword has to be done everynow and then and it becomes a very severe bottleneck for performance, even when it's automatic.
8 Years late, but as an update: I've had pretty good experience using Raima Database Manager. If you are looking for a small footprint db, they can get down to 40k. One of the reasons I like RDM is the platform independence, it is portable across 32-bit and 64-bit machines and between big-endian and little-endian architectures as well as support for most operating systems, meaning you can use it on Embedded Linux and eCos as mentioned in the first post. And it's performance gets better as you add better hardware and users as opposed to SQLite
i am not familiar with embed system, but iphone use arm9, and sqlite as DB
The 01-11-10 Embedded.com Newsletter does a nice job of covering this topic. The newsletter can be found at Embedded.com: Embedded.com Tech Focus Newsletter (1-11-10): Embedding Databases.

Resources