As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently I have encountered the concept of NoSQL and as far as I manage to comprehend is good for dealing with huge amount of data.
My question is, what is the limit were using NoSQL becomes worthwhile ? Is it only for companies which handle really huge amount of data like Google, Facebook etc. or it's worth the trouble to switching to it from a SQL database even for a smaller data amount .
I wonder what "concept of NoSQL" you mean, because it is an umbrella term for a wide field of different database technologies. The only thing they have in common is what sets them apart from each other: they are "not (only) SQL". They have widely different philosophies, use-cases and target groups.
Just to give you an overview, here are a few of the large factions of NoSQL databases.
There are document-based databases like MongoDB or CouchDB. Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
There are graph databases like Neo4j or GiraffeDB. Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Then you have simple key-value stores like MemcacheDB, Cassandra or Google's BigTable. They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
And these are just a few facets of the new database world.
But there is still one sector where relational databases excel, and that's when it comes to following the ACID principle. Most NoSQL databases don't fully guarantee all four of these:
Atomic transactions (chains of commands which are processed together, n-order and all-or-none)
Consistent database schema with constraints and triggers which ensure that garbage data can not exist in the database.
Isolation of transactions - transactions which are guaranteed to be unaffected by others which happen at the same time.
Durability - safety from data-loss even in case of a sudden system crash*
(* to be fair, most of the databases listed above are indeed pretty durable, especially those which are easy to set up as redundant fail-over clusters.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We're (re)designing a corporate information system. For the database design, we are exploring these options:
[Option 1->] a single CompanyBigDatabase that has everything,
[Option 2->] several databases across the company (say, HRD_DB, FinanceDB, MarketingDB), which then synchronized through a layer of application. EmployeeTable is owned by HRD, if Finance wants to refer to employees, it queries EmployeeTable from HRD_DB via a web-service.
What is the best practice? What's the pros and cons? We want it to have high availability and to be reasonably reliable. Does Option 1 necessitate clustering-and-all for this? Do big companies and universities (like Toyota, Samsung, Stanford Uni, MIT, ...), always opt for Option 1?
I was looking in many DB textbooks but I could not find a sufficient explanation on this topic.
Any thoughts, tips, links, or advice is welcome. Thanks.
Ive have done this type of work for 20 yrs. Enterprise Architecting is one term used to describe this. If you are asking this question, in a real enterprise scenario, im going to recommend you get advice. If it's a uni question, There are so many things to consider:
budget
politics
timeframes
legacy systems or green field,
Scope of Build
In house or Hosted
Complete Outsource of some or all of the functionality (SaaS)
....
Entire Methodologies are written to support projects that do this.
You can end up with many answers to the variables.
Even agreeing on how to weight features and outcomes is tough.
This is HUGE question you could right a book on.
Its like a 2 paragraph question where I have seen 10 people spend a month putting a business case together to do X. Thats just costing and planning the various options. Without selection of the final approach.
So I have not directly answered your question... that my friend is a
serious research project, not really a StackOverflow question.
There is no single answer. It depends on the many other factors such as database load, application architecture, scalability and etc. My suggestion start the simplest way possible (single database) and change it based on the needs.
Single database has it's advantages: simpler joins, referential integrity, single backup. Only separate pieces of data when you have valid reason/need.
In my opinion, it would be more appropriate to have database normalized and have several databases across the company based on departments. This would allow you to manage data more effectively in terms of storing, retrieving and updating information and providing access to users based on department type or user type. You can also provide different views of the database. It will be a lot more easier to manage data.
There is a general principle of databases in particular, and computing in general, that there should be a single authoritative source for every data item.
Extending this to sets of data, as soon as you have multiple customer lists, multiple lists of items, multiple email addresses, you are soon into a quagmire of uncertainty that will then call for a business intelligence solution to resolve them all.
Now I'm a business intelligence chap by historical leaning, but I'd be the first to say that this is not a path that you want to go down simply because Marketing and Accounts cannot decide the definition of "customer". You do it because your well-normalised OLTP systems do not make it easy to count how many customers there were yesterday, last week, and last year.
Nor should they either, because then they would be in danger of sacrificing their true purpose -- to maintain a high performance, high-integrity persistent store of the "data universe" that your company exists in.
So in other words, the single database approach has data integrity on it's side, and you really do not want to work in a company that does not have data integrity. As a Business Intelligence practitioner I can tell you that it is a horrible place.
On the other hand, you are going to have practical situations in which you simply must have separate systems due to application vendor constraints etc, and in that case the problem becomes one of keeping the data as tightly coupled as possible, and of Metadata Management (ugh) in which the company agrees what data in what location has what meaning.
Either will work and other decisions will mostly affect your specification. To some extent you question could be described as 'Should I go down the ERP path or the SAAS path"? I think it is telling that right now most systems are tending towards SAAS.
How will you be managing the applications? If they will be updated at different times separate DBs make more sense. (SAAS path). On the other hand having one DB to connect to, one authorisation system, one place to look for details, one place to backup, etc appears to decrease complexity in the technical space. But then does not allow decisions affecting one part of the business to be considered separately from other parts of the business
Once the business is involved trying to get a single time each department agrees to an upgrade can be hell. Having a degree of abstraction so that you only have to get one department to align before updating its part of the stack has real advantages in coming years. And if your web services are robust and don't change with each release this can be a much easier path.
Don't forget you can have views of data in other DBs.
And as for your question of how do most big companies work; generally by a miss-mash of big and little systems that sometimes talk to each other, sometimes don't and often repeat data. Having said that repeating data is a real problem; always have an authoritative source and copies (or even better only one copy). A method I have seen work well in a number of enterprises is to have one place details can be CRUDed (Created, Retrieved, Updated and Deleted) and numerous where it can be read.
And really this design decision has little or nothing to do with availability and reliability. That tends to come from good design (simplicity, knowing where things live, etc) good practices (good release practices, admin practtices, backups, intelligent redundancy, etc) and spending money. Not from having one or multiple systems.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really interesting in non relational databases, but due the many reason familiar only with the small part of it. So I want to list all NoSQL technologies you use with basic use cases, pros and cons.
If you have some specific issues during the work with some technologies, interesting experience, etc. you are welcome to share it with community.
Personally I worked with:
Mongodb:
Usecases: For my opinion is one of the best if you need good aggregation features, automatic replication. Good in scale. Have many features which allow using it like everyday use database and if for some reason you don't want to use SQL solution - Mongo could be the great choice. Also mongo is great if you need dynamic queries. And also mongodb support indexing - it's also important feature.
Pros: Fast, good scale, easy to use, internal geospatial Indexes
Cons: Comparatively slow write operation, blocking atomic operation could make a lot of problems. Memory consuming process could "eat" all available memory.
Couchdb:
Usecases: I use it in Wiki liked project and I think for that cases is the perfect database. The fact that each document automatically saves in new revision during update helps to see all the changes. For accumulating, occasionally changing data, on which pre-defined queries are to be run.
Pros: Easy to use, REST oriented interface, versions.
Cons: Problem with performance when amount of docs is quite huge (more than half a million), a bit pure query features (could be solving with adding Lucene)
SimpleDB:
Usecases: This is dataservice from Amazon, the cheapest from the all stuff they provide. Very limited in features so the main use case is using it if you want to use Amazon service, but paying as less ass possible.
Pros: Cheap, all data stored like text - simple to operate, easy to use.
Cons: Very much limitation (document size, collections size, attribute count, attribute size). The way that all data stored like a text also creating additional problems during sorting by date or by number (because it use lexicographical sorting, which need some workaround when saving date or numbers).
Cassandra
Cassandra is perfect solution if writing is your main goal, it's designed to write a lot (in some cases writing could be faster then reading), so it's perfect for logging. Also it very useful for data analysis. Except that Cassandra have built in geographical distribution features.
Strengths Supported by Apache (good community and high quality), fast writing, no single point for failure. Easy to manage when scale (easy to deploy and enlarge cluster).
Weaknesses indexes implementation have problems, querying by index have some limitation, and if you using indexes inserting performance decrease. Problems with stream data transfering.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm gathering information for upcoming massive online game. I has my experience with MEGA MASSIVE farm-like games (millions of dau), and SQL databases was great solution. I also worked with massive online game where NoSQL db was used, and this particular db (Mongo) was not a best fit - bad when lot of connections and lot of concurrent writes going on.
I'm looking for facts, benchmarks, presentation about modern massive online games and technical details about their backend infrastructure, databases in particular.
For example I'm interested in:
Can it manage thousands of connection? May be some external tool can help (like pgbouncer for postgres).
Can it manage tens of thousands of concurrent read-writes?
What about disk space fragmentation? Can it be optimized without stopping database?
What about some smart replication? Can it tell that some data is missing from replica, when master fails? Can i safely propagate slave to master and know exactly what data is missing and act appropriately?
Can it fail gracefully? (like postgres for ex.)
Good reviews from using in production
Start with the premise that hard crashes are exceedingly rare, and when they occur
it won't be a tragedy of some information is lost.
Use of the database shouldn't be strongly coupled to the routine management of the
game. Routine events ought to be managed through more ephemeral storage. Some
secondary process should organize ephemeral events for eventual storage in a database.
At the extreme, you could imagine there being just one database read and one database
write per character per session.
Have you considered NoSQL ?
NoSQL database systems are often highly optimized for retrieval and
appending operations and often offer little functionality beyond
record storage (e.g. key–value stores). The reduced run-time
flexibility compared to full SQL systems is compensated by marked
gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working
with a huge quantity of data when the data's nature does not require a
relational model. The data can be structured, but NoSQL is used when
what really matters is the ability to store and retrieve great
quantities of data, not the relationships between the elements. Usage
examples might be to store millions of key–value pairs in one or a few
associative arrays or to store millions of data records. This
organization is particularly useful for statistical or real-time
analyses of growing lists of elements (such as Twitter posts or the
Internet server logs from a large group of users).
There are higher-level NoSQL solutions, for example CrouchDB, which has built-in replication support.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used sql databases a fair bit and can see a lot of benefit in normalised databases that can be joined and searched and relationships built in them.
What are the advantages to the sort of 'object database' that google has in Appengine's datastore?
GAE's BigTable datastore is not object-oriented or even object-relational. It has more in common with a Hashmap than with a standard relational database like MySQL or Oracle. The main advantage is scalability and a tighter guarantee on the amount of time a query will take (sort of like CPU time). The scalability comes from the way records are distributed, if you setup your keys correctly then the data associated with those keys will be closer together physically (the data is distributed so there is no single point of failure).
As many NoSQL databases The main advantage of the Datastore is the flexibility nevertheless the programmer must forget everything about traditional SQL databases.
see this article in techrepublic.com about NoSQl databases
Data Model flexibility. The programmer doesn't have to worry about map the object model to relational model, just put your Entities in the Datastore.
Object relationship flexibility. The datastore supports multiple values for one single property, which let you stablish an 1-N relationship just like in the Object Oriented programming; I.e: inserting a List as a value of one property.
The rest of advantages/disadvantages comes from the PaaS (Platform as a service) model, wich means you only worry about write well code and google cares about the infrastructure and scalability. see PaaS in wikipedia
Technically it's a lot easier to program since the datastore is bundled with the SDK and easier to share source code and collaborate since you're getting all components from the same vendor rather than patching together an RDMS, a scripting engine and hosting.
Economically, the costeffectiveness GAE ha is a huge advantage since you only pay for what you use. With other services and other hosting you pay like a subscriber while with the model GAE has you pay per quota.
Programming-wise, everything is harder.
The advantages are in scalability, price, and administration. Considering that with many web-apps, programming is easier than administering/scaling/paying for it, GAE/datastore is well worth it.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've found databases typically come in two flavors, your traditional row-oriented RDBMS or an object oriented database (OODBMS). However, in the mid 90s I remember, a new breed of databases showing up that were column oriented. Some of these were given the term 4GL, but I don't think it was a term that stuck.
What I'd like to know is the following:
What column oriented databases still exist?
What are the performance characteristics of these databases?
Are there any open source column oriented databases?
What platforms do they interoperate with (.NET, Java, etc)
What's been your general experience with them?
The two column oriented databases that I remember working with are FAME and KDB.
HBase is an open-source column-oriented database system modelled on Google's BigTable.
Infobright
It's a column oriented MySQL engine
You can use (almost) all MySQL api's/interfaces/tools but it's column oriented.
It's open-source and has a free version.
It's very good for warehousing. I had a 10Gig fact table in SQL server.
Infobright compressed it to 15MB.
Also check out Michael Stonebraker's C-store:
C-store (includes links to source code and research paper)
The paper contains an excellent viewpoint on column oriented databases, that should answer most of your questions.
To quote the paper,
"Most major DBMS vendors implement record-oriented
storage systems, where the attributes of a record (or tuple)
are placed contiguously in storage. With this row store
architecture, a single disk write suffices to push all of the
fields of a single record out to disk. Hence, high
performance writes are achieved, and we call a DBMS
with a row store architecture a write-optimized system.
In contrast, systems oriented toward ad-hoc querying
of large amounts of data should be read-optimized. Data
warehouses represent one class of read-optimized system,
in which periodically a bulk load of new data is
performed, followed by a relatively long period of ad-hoc
queries. Other read-mostly applications include customer
relationship management (CRM) systems, electronic
library card catalogs, and other ad-hoc inquiry systems. In
such environments, a column store architecture, in which
the values for each single column (or attribute) are stored
contiguously, should be more efficient. This efficiency
has been demonstrated in the warehouse marketplace by
products like Sybase IQ [FREN95, SYBA04], Addamark
[ADDA04], and KDB [KDB04]. In this paper, we discuss
the design of a column store called C-Store that includes a
number of novel features relative to existing systems."
Sybase IQ is one I have heard of.
InfiniDB was recently released open source (GPLv2) by Calpont. It supports most of the MySQL API and stores data in a column-oriented fashion, and is optimized for large-scale analytic processing.
Here's the different column oriented DBMS wiki has
Column-Oriented DBMS Implementations
Sybase IQ is column oriented . All columns are automatically indexed when you create a table and data is nicely compressed in the columns.
It's a nice OLAP database (...data warehouse) but I would not recommend it for any kind of transaction processing as it is designed for data warehouse operations.
As for performance characteristics, SELECTS are very fast for large volumes of data but INSERT / UPDATE / DELETEs are very slow compared to a standard OLTP DB such as Sybase ASE for example. Table locking is also very different to a OLTP database so expect exclusive table locks for write operations (INSERTS etc) when working in the MAIN data store.
Otherwise it supports T-SQL (Sybase version) and Watcom SQL.
Cheers,
Kevin