As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used sql databases a fair bit and can see a lot of benefit in normalised databases that can be joined and searched and relationships built in them.
What are the advantages to the sort of 'object database' that google has in Appengine's datastore?
GAE's BigTable datastore is not object-oriented or even object-relational. It has more in common with a Hashmap than with a standard relational database like MySQL or Oracle. The main advantage is scalability and a tighter guarantee on the amount of time a query will take (sort of like CPU time). The scalability comes from the way records are distributed, if you setup your keys correctly then the data associated with those keys will be closer together physically (the data is distributed so there is no single point of failure).
As many NoSQL databases The main advantage of the Datastore is the flexibility nevertheless the programmer must forget everything about traditional SQL databases.
see this article in techrepublic.com about NoSQl databases
Data Model flexibility. The programmer doesn't have to worry about map the object model to relational model, just put your Entities in the Datastore.
Object relationship flexibility. The datastore supports multiple values for one single property, which let you stablish an 1-N relationship just like in the Object Oriented programming; I.e: inserting a List as a value of one property.
The rest of advantages/disadvantages comes from the PaaS (Platform as a service) model, wich means you only worry about write well code and google cares about the infrastructure and scalability. see PaaS in wikipedia
Technically it's a lot easier to program since the datastore is bundled with the SDK and easier to share source code and collaborate since you're getting all components from the same vendor rather than patching together an RDMS, a scripting engine and hosting.
Economically, the costeffectiveness GAE ha is a huge advantage since you only pay for what you use. With other services and other hosting you pay like a subscriber while with the model GAE has you pay per quota.
Programming-wise, everything is harder.
The advantages are in scalability, price, and administration. Considering that with many web-apps, programming is easier than administering/scaling/paying for it, GAE/datastore is well worth it.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We're (re)designing a corporate information system. For the database design, we are exploring these options:
[Option 1->] a single CompanyBigDatabase that has everything,
[Option 2->] several databases across the company (say, HRD_DB, FinanceDB, MarketingDB), which then synchronized through a layer of application. EmployeeTable is owned by HRD, if Finance wants to refer to employees, it queries EmployeeTable from HRD_DB via a web-service.
What is the best practice? What's the pros and cons? We want it to have high availability and to be reasonably reliable. Does Option 1 necessitate clustering-and-all for this? Do big companies and universities (like Toyota, Samsung, Stanford Uni, MIT, ...), always opt for Option 1?
I was looking in many DB textbooks but I could not find a sufficient explanation on this topic.
Any thoughts, tips, links, or advice is welcome. Thanks.
Ive have done this type of work for 20 yrs. Enterprise Architecting is one term used to describe this. If you are asking this question, in a real enterprise scenario, im going to recommend you get advice. If it's a uni question, There are so many things to consider:
budget
politics
timeframes
legacy systems or green field,
Scope of Build
In house or Hosted
Complete Outsource of some or all of the functionality (SaaS)
....
Entire Methodologies are written to support projects that do this.
You can end up with many answers to the variables.
Even agreeing on how to weight features and outcomes is tough.
This is HUGE question you could right a book on.
Its like a 2 paragraph question where I have seen 10 people spend a month putting a business case together to do X. Thats just costing and planning the various options. Without selection of the final approach.
So I have not directly answered your question... that my friend is a
serious research project, not really a StackOverflow question.
There is no single answer. It depends on the many other factors such as database load, application architecture, scalability and etc. My suggestion start the simplest way possible (single database) and change it based on the needs.
Single database has it's advantages: simpler joins, referential integrity, single backup. Only separate pieces of data when you have valid reason/need.
In my opinion, it would be more appropriate to have database normalized and have several databases across the company based on departments. This would allow you to manage data more effectively in terms of storing, retrieving and updating information and providing access to users based on department type or user type. You can also provide different views of the database. It will be a lot more easier to manage data.
There is a general principle of databases in particular, and computing in general, that there should be a single authoritative source for every data item.
Extending this to sets of data, as soon as you have multiple customer lists, multiple lists of items, multiple email addresses, you are soon into a quagmire of uncertainty that will then call for a business intelligence solution to resolve them all.
Now I'm a business intelligence chap by historical leaning, but I'd be the first to say that this is not a path that you want to go down simply because Marketing and Accounts cannot decide the definition of "customer". You do it because your well-normalised OLTP systems do not make it easy to count how many customers there were yesterday, last week, and last year.
Nor should they either, because then they would be in danger of sacrificing their true purpose -- to maintain a high performance, high-integrity persistent store of the "data universe" that your company exists in.
So in other words, the single database approach has data integrity on it's side, and you really do not want to work in a company that does not have data integrity. As a Business Intelligence practitioner I can tell you that it is a horrible place.
On the other hand, you are going to have practical situations in which you simply must have separate systems due to application vendor constraints etc, and in that case the problem becomes one of keeping the data as tightly coupled as possible, and of Metadata Management (ugh) in which the company agrees what data in what location has what meaning.
Either will work and other decisions will mostly affect your specification. To some extent you question could be described as 'Should I go down the ERP path or the SAAS path"? I think it is telling that right now most systems are tending towards SAAS.
How will you be managing the applications? If they will be updated at different times separate DBs make more sense. (SAAS path). On the other hand having one DB to connect to, one authorisation system, one place to look for details, one place to backup, etc appears to decrease complexity in the technical space. But then does not allow decisions affecting one part of the business to be considered separately from other parts of the business
Once the business is involved trying to get a single time each department agrees to an upgrade can be hell. Having a degree of abstraction so that you only have to get one department to align before updating its part of the stack has real advantages in coming years. And if your web services are robust and don't change with each release this can be a much easier path.
Don't forget you can have views of data in other DBs.
And as for your question of how do most big companies work; generally by a miss-mash of big and little systems that sometimes talk to each other, sometimes don't and often repeat data. Having said that repeating data is a real problem; always have an authoritative source and copies (or even better only one copy). A method I have seen work well in a number of enterprises is to have one place details can be CRUDed (Created, Retrieved, Updated and Deleted) and numerous where it can be read.
And really this design decision has little or nothing to do with availability and reliability. That tends to come from good design (simplicity, knowing where things live, etc) good practices (good release practices, admin practtices, backups, intelligent redundancy, etc) and spending money. Not from having one or multiple systems.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really interesting in non relational databases, but due the many reason familiar only with the small part of it. So I want to list all NoSQL technologies you use with basic use cases, pros and cons.
If you have some specific issues during the work with some technologies, interesting experience, etc. you are welcome to share it with community.
Personally I worked with:
Mongodb:
Usecases: For my opinion is one of the best if you need good aggregation features, automatic replication. Good in scale. Have many features which allow using it like everyday use database and if for some reason you don't want to use SQL solution - Mongo could be the great choice. Also mongo is great if you need dynamic queries. And also mongodb support indexing - it's also important feature.
Pros: Fast, good scale, easy to use, internal geospatial Indexes
Cons: Comparatively slow write operation, blocking atomic operation could make a lot of problems. Memory consuming process could "eat" all available memory.
Couchdb:
Usecases: I use it in Wiki liked project and I think for that cases is the perfect database. The fact that each document automatically saves in new revision during update helps to see all the changes. For accumulating, occasionally changing data, on which pre-defined queries are to be run.
Pros: Easy to use, REST oriented interface, versions.
Cons: Problem with performance when amount of docs is quite huge (more than half a million), a bit pure query features (could be solving with adding Lucene)
SimpleDB:
Usecases: This is dataservice from Amazon, the cheapest from the all stuff they provide. Very limited in features so the main use case is using it if you want to use Amazon service, but paying as less ass possible.
Pros: Cheap, all data stored like text - simple to operate, easy to use.
Cons: Very much limitation (document size, collections size, attribute count, attribute size). The way that all data stored like a text also creating additional problems during sorting by date or by number (because it use lexicographical sorting, which need some workaround when saving date or numbers).
Cassandra
Cassandra is perfect solution if writing is your main goal, it's designed to write a lot (in some cases writing could be faster then reading), so it's perfect for logging. Also it very useful for data analysis. Except that Cassandra have built in geographical distribution features.
Strengths Supported by Apache (good community and high quality), fast writing, no single point for failure. Easy to manage when scale (easy to deploy and enlarge cluster).
Weaknesses indexes implementation have problems, querying by index have some limitation, and if you using indexes inserting performance decrease. Problems with stream data transfering.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm gathering information for upcoming massive online game. I has my experience with MEGA MASSIVE farm-like games (millions of dau), and SQL databases was great solution. I also worked with massive online game where NoSQL db was used, and this particular db (Mongo) was not a best fit - bad when lot of connections and lot of concurrent writes going on.
I'm looking for facts, benchmarks, presentation about modern massive online games and technical details about their backend infrastructure, databases in particular.
For example I'm interested in:
Can it manage thousands of connection? May be some external tool can help (like pgbouncer for postgres).
Can it manage tens of thousands of concurrent read-writes?
What about disk space fragmentation? Can it be optimized without stopping database?
What about some smart replication? Can it tell that some data is missing from replica, when master fails? Can i safely propagate slave to master and know exactly what data is missing and act appropriately?
Can it fail gracefully? (like postgres for ex.)
Good reviews from using in production
Start with the premise that hard crashes are exceedingly rare, and when they occur
it won't be a tragedy of some information is lost.
Use of the database shouldn't be strongly coupled to the routine management of the
game. Routine events ought to be managed through more ephemeral storage. Some
secondary process should organize ephemeral events for eventual storage in a database.
At the extreme, you could imagine there being just one database read and one database
write per character per session.
Have you considered NoSQL ?
NoSQL database systems are often highly optimized for retrieval and
appending operations and often offer little functionality beyond
record storage (e.g. key–value stores). The reduced run-time
flexibility compared to full SQL systems is compensated by marked
gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working
with a huge quantity of data when the data's nature does not require a
relational model. The data can be structured, but NoSQL is used when
what really matters is the ability to store and retrieve great
quantities of data, not the relationships between the elements. Usage
examples might be to store millions of key–value pairs in one or a few
associative arrays or to store millions of data records. This
organization is particularly useful for statistical or real-time
analyses of growing lists of elements (such as Twitter posts or the
Internet server logs from a large group of users).
There are higher-level NoSQL solutions, for example CrouchDB, which has built-in replication support.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently I have encountered the concept of NoSQL and as far as I manage to comprehend is good for dealing with huge amount of data.
My question is, what is the limit were using NoSQL becomes worthwhile ? Is it only for companies which handle really huge amount of data like Google, Facebook etc. or it's worth the trouble to switching to it from a SQL database even for a smaller data amount .
I wonder what "concept of NoSQL" you mean, because it is an umbrella term for a wide field of different database technologies. The only thing they have in common is what sets them apart from each other: they are "not (only) SQL". They have widely different philosophies, use-cases and target groups.
Just to give you an overview, here are a few of the large factions of NoSQL databases.
There are document-based databases like MongoDB or CouchDB. Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
There are graph databases like Neo4j or GiraffeDB. Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Then you have simple key-value stores like MemcacheDB, Cassandra or Google's BigTable. They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
And these are just a few facets of the new database world.
But there is still one sector where relational databases excel, and that's when it comes to following the ACID principle. Most NoSQL databases don't fully guarantee all four of these:
Atomic transactions (chains of commands which are processed together, n-order and all-or-none)
Consistent database schema with constraints and triggers which ensure that garbage data can not exist in the database.
Isolation of transactions - transactions which are guaranteed to be unaffected by others which happen at the same time.
Durability - safety from data-loss even in case of a sudden system crash*
(* to be fair, most of the databases listed above are indeed pretty durable, especially those which are easy to set up as redundant fail-over clusters.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would like to remove sql dependency of small chunks of data that I load on (almost) each request on a web application. Most of the data is key-value/document structured, but a relational solution is not excluded. The data is not too big so I want to keep it in memory for higher availability.
What solution would you recommend?
The simplest and most widely used in-memory Key-value storage is MemcacheD. The introduction page re-iterates what you are asking for:
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
The client list is impressive. It's been for a long time. Good documentation. It has API for almost every programming language. Horizontal scaling is pretty simple. As my experience goes Memcached is good.
You may also want to look into MemBase.
Redis is perfect for this kind of data. It also supports some fundamental datastructures and provides operations on them.
I recently converted my Django forum app to use it for all real-time/tracking data - it's so good to no longer have the icky feeling you get when you do this kind of stuff (SET views = views + 1 and other writes on every page view) with a relational database.
Here's an example of using Redis to store data required for user activity tracking, including keeping an ordered set of last seen users up to date, in Python:
def seen_user(user, doing, item=None):
"""
Stores what a User was doing when they were last seen and updates
their last seen time in the active users sorted set.
"""
last_seen = int(time.mktime(datetime.datetime.now().timetuple()))
redis.zadd(ACTIVE_USERS, user.pk, last_seen)
redis.setnx(USER_USERNAME % user.pk, user.username)
redis.set(USER_LAST_SEEN % user.pk, last_seen)
if item:
doing = '%s %s' % (
doing, item.get_absolute_url(), escape(str(item)))
redis.set(USER_DOING % user.pk, doing)
If you don't mind the sql but want to keep the db in memory, you might want to check out sqlite (see http://www.sqlite.org/inmemorydb.html).
If you don't want the sql and you really only have key-value pairs, why not just store them in a map / hash / associative array and be done with it?
If you end up needing an in-memory database, H2 is a very good option.
One more database to consider: Berkeley DB. Berkeley DB allows you to configure the database to be in-memory, on-disk or both. It supports both a key-value (NoSQL) and a SQL API. Berkeley DB is often used in combination with web applications because it's embedded, easily deployed (it deploys with your application), highly configurable and very reliable. There are several e-Retail web sites that rely on Berkeley DB for their e-Commerce applications, including Amazon.com.
I'm not sure this is what you are looking for but you should look into a caching framework (something that may be included in the tools you are using now). With a repository pattern you ask for the data, there you check if you have it in cache by key. I you don't, you fetch it from the database, if you do, you fetch it from the cache.
It will depend on what kind of data you are handling so it's up to you to decide how long to keep data in cache. Perhaps a sliding timeout is best as you'll keep the data as long as the key keeps being request. Which means if the cache has data for a user, once the user goes away, the data will expire from the cache.
Can you shard this data? Is data access pattern simple and stable (does not change with changing business requirements)? How critical is this data (session context, for example, is not too hard to restore, whereas some preferences a user has entered on a settings page should not be lost)?
Typically, provided you can shard and your data access patterns are simple and do not mutate too much, you choose Redis. If you look for something more reliable and supporting more advanced data access patterns, Tarantool is a good option.
Please do check out this :
http://www.mongodb.org/
Its a really good No-SQL database with drivers and support for all major languages.