NoSQL technologies, use cases, strengths and weaknesses [closed] - database

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really interesting in non relational databases, but due the many reason familiar only with the small part of it. So I want to list all NoSQL technologies you use with basic use cases, pros and cons.
If you have some specific issues during the work with some technologies, interesting experience, etc. you are welcome to share it with community.
Personally I worked with:
Mongodb:
Usecases: For my opinion is one of the best if you need good aggregation features, automatic replication. Good in scale. Have many features which allow using it like everyday use database and if for some reason you don't want to use SQL solution - Mongo could be the great choice. Also mongo is great if you need dynamic queries. And also mongodb support indexing - it's also important feature.
Pros: Fast, good scale, easy to use, internal geospatial Indexes
Cons: Comparatively slow write operation, blocking atomic operation could make a lot of problems. Memory consuming process could "eat" all available memory.
Couchdb:
Usecases: I use it in Wiki liked project and I think for that cases is the perfect database. The fact that each document automatically saves in new revision during update helps to see all the changes. For accumulating, occasionally changing data, on which pre-defined queries are to be run.
Pros: Easy to use, REST oriented interface, versions.
Cons: Problem with performance when amount of docs is quite huge (more than half a million), a bit pure query features (could be solving with adding Lucene)
SimpleDB:
Usecases: This is dataservice from Amazon, the cheapest from the all stuff they provide. Very limited in features so the main use case is using it if you want to use Amazon service, but paying as less ass possible.
Pros: Cheap, all data stored like text - simple to operate, easy to use.
Cons: Very much limitation (document size, collections size, attribute count, attribute size). The way that all data stored like a text also creating additional problems during sorting by date or by number (because it use lexicographical sorting, which need some workaround when saving date or numbers).

Cassandra
Cassandra is perfect solution if writing is your main goal, it's designed to write a lot (in some cases writing could be faster then reading), so it's perfect for logging. Also it very useful for data analysis. Except that Cassandra have built in geographical distribution features.
Strengths Supported by Apache (good community and high quality), fast writing, no single point for failure. Easy to manage when scale (easy to deploy and enlarge cluster).
Weaknesses indexes implementation have problems, querying by index have some limitation, and if you using indexes inserting performance decrease. Problems with stream data transfering.

Related

Reasons for absence of performance test for oss database systems? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I used open source java implementation of TPC-C benchmark (called TCJ - TPC-C via JDBC (created by MMatejka last year)) to compare the performance of Oracle and 2 OSS DBMS.
TPC-C is standard in the proprietary sphere and my question is:
What are the main reasons that there is not systematically implemented performance test for OSS database systems?
Firstly, I'm not sure your question is a perfect fit for SO as it is getting close to asking opinion, and so all of my answer is more opinion than fact. most of this I have read over the years, but will struggle to find references/proof anymore. I am not a TPC member, but did heavily investigate trying to get a dsitributed column store database tested under the TPC-H suite.
Benchmarks
These are great at testing a single feature and comparing them, unfortunately, that is nowhere as easy as it sounds. Companies will spend large amounts of effort to get better results, sometimes (so I have heard) implementing specific functions in the source for a benchmark. There is a lot of discussion about how reliable benchmark results are overall. Also, a benchmark may be a perfect fit for some product, but not another.
Your example uses Jdbc, but not every database has jdbc, or worse it may be a 'minor bolt on' just to enable that class of application. So performing benchmarks via jdbc when all main usage will be embedded sql may portray some solutions unfairly/poorly.
There is some arguments around that benchmarks distract vendors from real priorities, they spend effort and implement features solely for benchmarks.
Benchmarks can also be very easily misunderstood, even TPC is a suite of different benchmarks and you need to select the correct one for your needs ( tpc-c for oltp, tpc-h for dss etc)
TPC
If this reads as negative for tpc, please forgive me, I am pro tpc.
Tpc defines a very tight set of test requirements. You must follow these to a letter. For tpc-h this is an example of what you must do
do multiple runs, some in parallel, some single user
use exactly the sql provided, you must not change it at all. If you need to because your system uses a slightly different syntax, but must get a waiver.
you must use an external auditor.
you may not index colmns etc beyond what is specified.
for tpch you must do writing in a specified way (which eliminates 'single writer' style databases)
The above ensures that people reading the results can have trust in the integrity of the results, which is great for a corporate buyer.
Tpc is a non profit org and anybody can join. There is a fee but it isnt a major barrier, except for OSS. You are only realistically going to pay this fee if you think you can achieve really great results, or you need published results to bid for govt contracts etc.
The biggest problem I see with tpc for oss is that it is heavily skewed towards relational vendors and very few oss solutions can met the entry criteria with their offerings, or if they do they may not perform well enough for every test. Doing a benchmark may also be a distraction for some teams.
Alternatives to tpc
Of course alternatives exist to tpc, but none really gain traction, as yet, that i am aware of. Major vendors often stipulate that you cannot benchmark their products and publish the results. So any new benchmark will need to be politically astutue to get them on board. I agree with the vendors stance here, I would hate someone to mis-implement a benchmark and report my product poorly.
The database landscape has fractured a lot since tpc started, but many 'bet you business' applications still run on 'classic' databases, so they still have a place. However, with the rise in nosql etc, there is a place for new benchmarks, but the real question becomes what to measure - even chosing xyz like '%kitten%'. Or xyz like 'kitten%'. Will have dramatic effects on different solutions. If you solve that, what common interface wil,you allow (odbc, jdbc, http/ajax, embedded sql, etc) each of these interfaces affects performance greatly. What about the actual models, such as ACID for relational models vs eventual consistency models? What about hardware/software solutions that use specificaly designed hardware?
Each database has made design trade offs for different needs, and a benchmark is attempting to level the playing field, which is only really possible if you have something in common, or report lots of different metrics.
One of the problems with trying to create an alternative is that 'who will pay'? You need consenus over the type of tests to perform, and then you need to audit results for them to be meaningful. This all costs money.

What database to choose for massive amount of connections and concurent writes (online game) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm gathering information for upcoming massive online game. I has my experience with MEGA MASSIVE farm-like games (millions of dau), and SQL databases was great solution. I also worked with massive online game where NoSQL db was used, and this particular db (Mongo) was not a best fit - bad when lot of connections and lot of concurrent writes going on.
I'm looking for facts, benchmarks, presentation about modern massive online games and technical details about their backend infrastructure, databases in particular.
For example I'm interested in:
Can it manage thousands of connection? May be some external tool can help (like pgbouncer for postgres).
Can it manage tens of thousands of concurrent read-writes?
What about disk space fragmentation? Can it be optimized without stopping database?
What about some smart replication? Can it tell that some data is missing from replica, when master fails? Can i safely propagate slave to master and know exactly what data is missing and act appropriately?
Can it fail gracefully? (like postgres for ex.)
Good reviews from using in production
Start with the premise that hard crashes are exceedingly rare, and when they occur
it won't be a tragedy of some information is lost.
Use of the database shouldn't be strongly coupled to the routine management of the
game. Routine events ought to be managed through more ephemeral storage. Some
secondary process should organize ephemeral events for eventual storage in a database.
At the extreme, you could imagine there being just one database read and one database
write per character per session.
Have you considered NoSQL ?
NoSQL database systems are often highly optimized for retrieval and
appending operations and often offer little functionality beyond
record storage (e.g. key–value stores). The reduced run-time
flexibility compared to full SQL systems is compensated by marked
gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working
with a huge quantity of data when the data's nature does not require a
relational model. The data can be structured, but NoSQL is used when
what really matters is the ability to store and retrieve great
quantities of data, not the relationships between the elements. Usage
examples might be to store millions of key–value pairs in one or a few
associative arrays or to store millions of data records. This
organization is particularly useful for statistical or real-time
analyses of growing lists of elements (such as Twitter posts or the
Internet server logs from a large group of users).
There are higher-level NoSQL solutions, for example CrouchDB, which has built-in replication support.

NoSQL worthwhile usage [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently I have encountered the concept of NoSQL and as far as I manage to comprehend is good for dealing with huge amount of data.
My question is, what is the limit were using NoSQL becomes worthwhile ? Is it only for companies which handle really huge amount of data like Google, Facebook etc. or it's worth the trouble to switching to it from a SQL database even for a smaller data amount .
I wonder what "concept of NoSQL" you mean, because it is an umbrella term for a wide field of different database technologies. The only thing they have in common is what sets them apart from each other: they are "not (only) SQL". They have widely different philosophies, use-cases and target groups.
Just to give you an overview, here are a few of the large factions of NoSQL databases.
There are document-based databases like MongoDB or CouchDB. Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
There are graph databases like Neo4j or GiraffeDB. Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Then you have simple key-value stores like MemcacheDB, Cassandra or Google's BigTable. They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
And these are just a few facets of the new database world.
But there is still one sector where relational databases excel, and that's when it comes to following the ACID principle. Most NoSQL databases don't fully guarantee all four of these:
Atomic transactions (chains of commands which are processed together, n-order and all-or-none)
Consistent database schema with constraints and triggers which ensure that garbage data can not exist in the database.
Isolation of transactions - transactions which are guaranteed to be unaffected by others which happen at the same time.
Durability - safety from data-loss even in case of a sudden system crash*
(* to be fair, most of the databases listed above are indeed pretty durable, especially those which are easy to set up as redundant fail-over clusters.

Appengine Datastore Advantages [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used sql databases a fair bit and can see a lot of benefit in normalised databases that can be joined and searched and relationships built in them.
What are the advantages to the sort of 'object database' that google has in Appengine's datastore?
GAE's BigTable datastore is not object-oriented or even object-relational. It has more in common with a Hashmap than with a standard relational database like MySQL or Oracle. The main advantage is scalability and a tighter guarantee on the amount of time a query will take (sort of like CPU time). The scalability comes from the way records are distributed, if you setup your keys correctly then the data associated with those keys will be closer together physically (the data is distributed so there is no single point of failure).
As many NoSQL databases The main advantage of the Datastore is the flexibility nevertheless the programmer must forget everything about traditional SQL databases.
see this article in techrepublic.com about NoSQl databases
Data Model flexibility. The programmer doesn't have to worry about map the object model to relational model, just put your Entities in the Datastore.
Object relationship flexibility. The datastore supports multiple values for one single property, which let you stablish an 1-N relationship just like in the Object Oriented programming; I.e: inserting a List as a value of one property.
The rest of advantages/disadvantages comes from the PaaS (Platform as a service) model, wich means you only worry about write well code and google cares about the infrastructure and scalability. see PaaS in wikipedia
Technically it's a lot easier to program since the datastore is bundled with the SDK and easier to share source code and collaborate since you're getting all components from the same vendor rather than patching together an RDMS, a scripting engine and hosting.
Economically, the costeffectiveness GAE ha is a huge advantage since you only pay for what you use. With other services and other hosting you pay like a subscriber while with the model GAE has you pay per quota.
Programming-wise, everything is harder.
The advantages are in scalability, price, and administration. Considering that with many web-apps, programming is easier than administering/scaling/paying for it, GAE/datastore is well worth it.

Comparing Database Platforms [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
My employer has a database committee and we've been discussing different platforms. Thus far we've had people present on SqlLite, MySQL, and PostreSql. We currently use the Microsoft stack so we're all quite familiar with Microsoft Sql Server.
As a part of this comparison I thought it would be interesting to create a small reference application for each database platform to explore the details involved in working with it.
First: does that idea make sense or is the comparison require going beyond the scope of a trivial sample application?
Second: I would imagine each reference application having a discrete but small set of requirements that fulfill many of the scenarios we run into on a regular basis. Here is what I have so far, what else can be added to the list but still keep the application small enough to be built in a very limited timespan?
Connectivity from the application layer
Tools for database administration
Process of creating a schema (small "s" schema, tables/views/functions other objects)
Simple CRUD (Create, Retrieve, Update, Delete)
Transaction support
Third: has anyone gone through this process, what are your findings?
Does that idea make sense or is the
comparison require going beyond the
scope of a trivial sample application?
I don't think it's a good idea. Most of the things that will really affect you are long term database management issues, and how the database management system you choose can handle those things.
You could be tempted in the short term with things like "I found out in 3 seconds how to do this with XYZ database management system". Now, I'm not saying support is not important; quite the contrary. But finding an answer in google in 3 seconds means that you got an answer to a simple question. How quickly, if ever, can you find an answer to a challenging problem?
A short list (not exhaustive) of important things to consider are:
backup and recovery -- at both logical level and physical level
good support for functions (or stored procedures), triggers, various SQL query constructs
APIs that allow real extensibility -- these things can get you out of tough situations and allow you to solve problems in creative ways. You'd be surprised what can be accomplished with user-defined types and functions. How do the user-defined types interact with the indexing system?
SQL standard support -- doesn't trump everything else, but if support is lacking in a few areas, really consider why it is lacking, what the workarounds are, and what are the costs of those workarounds.
A powerful executor that offers a range of fundamental algorithms (e.g. hash join, merge join, etc.) and indexing structures (btree, hash, maybe a full text option, etc.). If it's missing some algorithms or index structures, consider the types of questions that the database will be inefficient at answering. Note: I don't just mean "slow" here; the wrong algorithm can easily be worse by orders of magnitude.
Can the type system reasonably represent your business? If the set of types available is incredibly weak, you will have a mess. Representing everything as strings is kind of like assembly programming (untyped), and you will have a mess.
A trivial application won't show you any of those things. Simple things are simple to solve. If you have a "database committee" then your company cares about its data, and you should take the responsibility seriously. You need to make sure that you can develop applications on it easily with the results you and your developers expect; and when you run into problems you need to have access to a powerful system and quality support that can get you through it.
Actually learning capabilities of each RDMS is more crucial. Because it depends on the application. If you need spatial data capabilities PostGIS with PostgreSQL is better than MySQL. If you need easy replication, high availability features MySQL seems better. Also there are license issues. A link for comparison here. All has strengths and weaknesses. First get the requirements of your project or projects than compare it with list the features of the RDMSs you pick and decide which one to go.
I don't think you need to test the simple CRUD stuff, it's hard to imagine a vendor that doesn't support the basics.
Firstly, you're going beyond the scope of a sample app, in my humble opinion.
Secondly, I'd pick the one most appropriate to the tool or application you wish to develop. For example, are schemas and transactions relevant for a database that stores a single-user app configuration?
Thirdly, I've worked with Access, SQL Server, SQLite, MySQL, PostgreSQL and Oracle, and they all have their place. If you're in the MS space, go with SQL Server (and don't forget Express). There are also ADO.NET ways to talk to the others in my list. It depends on what you want.
Frankly, I doubt an arbitrarily-defined simple application would be likely to really highlight the differences between database engines. I think you'd be better to read the advertising literature for the various engines to see what they claim as their strong points. Then consider which of these matter to you, and construct tests specifically designed to verify claims that you care about.
For example, here are pros and cons of database engines I've used the most that have mattered to me. I don't claim this is an exhaustive list, but it may give you an idea of things to think about:
MySQL: Note: MySQL has two main engines internally: MyISAM and InnoDB. I have never used the InnoDB.
Pros: Fast. Free to cheap depending on how you're using it. Very convenient and easy-to-use commands for managing the schema. Some very useful extensions to the SQL standard, like "insert ... on duplicate".
Cons: The MyISAM engine does not support transactions, i.e. there's no rollback. MyISAM engine does not manage foreign keys for you. (InnoDB does not have these drawbacks, but as I say, I've never used it, so I can't comment much further.) Many deviations from SQL standards.
Oracle: Pros: Fast. Generally good conformance to SQL standards. My brother works for Oracle so if you buy there you'll be helping support my family. (Okay, maybe that's not an important pro for you ...)
Cons: Difficult to install and manage. Expensive.
Postgres: Pros: Very high conformance to SQL standards. Free. Very good "explain" plans.
Cons: Relatively slow. Optimizer is easily confused on complex queries. Some awkwardness in modifying existing tables.
Access: Pros: Easy to install and manage. Very easy to use schema management. Built-in data entry tools and query builder for quick-and-dirty stuff. Cheap.
Cons: Slow. Unreliable with multiple users.
I think that you can investigate Firebird too
This is an extract of Firebird-General on yahoogroups and I find it quite objective
Our natural audience is developers who
want to package and sell proprietary
applications. Firebird is easier to
package and install than Postgres;
more capable than SQLite; and doesn't
charge a royalty like MySQL.

Resources