What sort of Database system should I use? - database

I'm planning to write an address book that stored contact information.
Each contact could have an unlimited number of fields.
Mostly strings and integers.
But perhaps references to other Objects.
What are the advantages and disadvantages of using an RDBMS with ORM vs OODBMS vs Document DBMS (like CouchDB).
Thanks.

Most of the problems with relational databases is that if you have vast amounts of tables that have joins to one or many tables, and if you require to pull off data once off, you will have to optimise your SQL query to make joins efficient.
In NoSQL databases, the main objective was to be able to be fast and scale horizontally. Some do avoid data joins so you would have to do this yourself (by pulling data in memory and do match joins). Facebook's own Cassandra (now an Apache Project) is basically a NoSQL database system which guarantees no single point of failure.
Also, RDBMS indexing is relatively faster (but that can be debatable) compared to NoSQL databases when it comes to indexing large documents.
I haven't played with CouchDB or MongoDB so I can't compare them. All I know is some do joins in memory (like Redis) which in effect means, pulling all data from database to memory (RAM) and doing joins.
I don't know if that's what you're looking for.

Consider writing the data to a custom text file.
People's address books rarely get past a few hundred entries, so it's easy enough to scan through the entire list for whatever action you need to do.

Related

Best database and database design for this particular need?

I'm looking to store around 50-100 million documents in a database and be able to do queries at a very fast speed. A document would look something like this:
{
name: 'example',
value: '300,201,512'
}
The value column is always unique, name is not.
Now I want to be able to only check if there exists a document with a specific value using a query. What database would be the best choice and which design would be best to approach the fastest speed for a query like that?
NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help.
In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Note: When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).
These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.
Fasted document based db
1) MongoDB
2) DynamoDB
Here is difference for your reference
I will give preference to DynamoDB
Currently, we are working on aws datalake, really fast performance
store data in s3 and get back via athena.
If you want to import data on to some database then try using MS SQL Server 2008 R2, because it is very user friendly and allow you to do your work more accurately and precisely. If you want to do that without any cost, then MySQL will be a better option to do so(better MySQL editor is SQLYog). I hope it would be beneficial for you.
Short Answer:
I think, 100 million documents in your mentioned structure and conditions is not BIG ENOUGH to use NoSQL. You can handle them with PostgreSQL and MySQL and etc.
Note that: for a long time Wikipedia used MySQL (not now). see Reference

Double index within a noSQL database

I am working on creating a database to store three things. Let's say Experiment, Measure, metadata. The metadata is composed of a set of variable number and type of attributes, thus making the choice of a NoSQL attractive.
I need two simple queries over the database:
1) Give me the metadata of all the experiments with a given value of Measure.
2) Give me the metadata of all the measures for a Experiment.
And my main requirements are:
1) Tons of data. Each Experiment can come with millions of possible measures (and of course the metadata), and I expect tenths of thousands of Experiments.
2) Concurrency. I would like to have fast concurrent read/write because at any given point in time I may be running 10-20 experiments, and they will want to write millions of measures at the same time.
I've tried MongoDB, but it is slow due to the write locks. I would like to have something faster. Additionally, it does not handle well one of my queries, as I basically need two indexes here. I am considering as an alternative Titan, just because it seems natural to think of experiments an measures as nodes, and connect them with edges. Hypertable seems another possibility if I can find a way of doing both queries fast.
There are so many noSQL databases out there that I may be missing the right one for my needs. Suggestions?
Have you looked into NewSQL databases that could fit your needs? I suggest that you take a closer look at Starcounter that is true ACID, no locks on the writes and supports indexing on basic properties as well as combined indexes.
I think a transactional database that is object oriented and memory centric would suit your demands. You can then have different Experiments and Measures that derives the same class and you can select to query each type as well as query the ineherited types separately.
If you do not have more than TB of data you do not need a big data database that you have looked into so far. They are really good at what they do, but I think you should look into the other spectrum of NoSQL databases. When using an in-memory (all writes secured on persistent storage media of course) database that is object oriented you get about 4 times compressions compared to relational databases, so the TB of data would often be enaugh.
It is really hard to find your way around in the jungle of databases today, so I understand the difficulty of finding something that fits your requirements. In your case - my 5 cents on a transactional NoSQL database that is true ACID and with SQL query support!

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

Very large database, very small portion most being retrieved in real time

I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.
Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.
Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner
How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)
More info:
We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys
Update 2
The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter
Update 3
Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution
This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.
Data that is current, for immediate access, being updated.
Data that is historical fact, for analysis, not being updated.
150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.
Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.
If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.
This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.
You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.
With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.
If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)
UPDATE:
See documentation of SELECT and scroll down to SQL_NO_CACHE.
Read about the Query Cache
Ensure query_cache_type set appropriate for your needs.
UPDATE 2:
I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.
So, what is the problem?
First, 150gb is not very large today. It was 10 years ago.
Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).
You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.

Pro's of databases like BigTable, SimpleDB

New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.
In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent
Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales?
Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?
Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).
That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.
Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.
But if you are in the 100 Tera range, by all means...
Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.
On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.
I'm not too familiar with them (I've only read the same blog/news/examples as everyone else) but my take on it is that they chose to sacrifice a lot of the normal relational DB features in the name of scalability - I'll try explain.
Imagine you have 200 rows in your data-table.
In google's datacenter, 50 of these rows are stored on server A, 50 on B, and 100 on server C. Additionally server D contains redundant copies of data from server A and B, and server E contains redundant copies of data on server C.
(In real life I have no idea how many servers would be used, but it's set up to deal with many millions of rows, so I imagine quite a few).
To "select * where name = 'orion'", the infrastructure can fire that query to all the servers, and aggregate the results that come back. This allows them to scale pretty much linearly across as many servers as they like (FYI this is pretty much what mapreduce is)
This however means you need some tradeoffs.
If you needed to do a relational join on some data, where it was spread across say 5 servers, each of those servers would need to pull data from eachother for each row. Try do that when you have 2 million rows spread across 10 servers.
This leads to tradeoff #1 - No joins.
Also, depending on network latency, server load, etc, some of your data may get saved instantly, but some may take a second or 2. Again, when you have dozens of servers, this gets longer and longer, and the normal approach of 'everyone just waits until the slowest guy has finished' no longer becomes acceptable.
This leads to tradeoff #2 - Your data may not always be immediately visible after it's written.
I'm not sure what other tradeoffs there are, but off the top of my head those are the main 2.
So what I'm getting is that the whole "denormalize, no joins" philosophy exists, not because joins themselves don't scale in large systems, but because they're practically impossible to implement in distributed databases.
This seems pretty reasonable when you're storing largely invariant data of a single type (Like Google does). Am I on the right track here?
If you are talking about data that is virtually read-only, the rules change. Denormalisation is hardest in situations where data changes because the work required is increased and there are more problems with locking. If the data barely changes then denormalisation is not so much of a problem.
Novaday You need to find more interoperational environment for databases. More frequently You don't need only an relational DBs, like MySQL or MS SQL but also Big Data farms as Hadoop or non-relational DBs like MongoDB. In some cases all those DBs will be used in one solution so their performance must be as equal as possible in macro scale. It means, that You will not be able to use let say Azure SQL as relational DB and one VM with 2 cores and 3GB of RAM for MongoDB. You must scale-up Your solution and use DB as a Service when it is possible (if it is not possible, then build Your own cluster in a cloud).

Resources