Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm curious if anyone has implemented or even knows of any bitemporal databases built on NoSQL platforms (e.g., riak).
I don't know of any NoSQL datastore that are specifically designed to handle temporal data.
In order to put the valid and transaction time periods onto data in Riak you would need to either:
Wrap your documents/values with a structure that can hold metadata like:
{
meta:{
valid:["2001-11-08", "2001-11-09"],
transaction:["2011-01-29 10:27:00", "2011-01-29 10:28:00"]
}
payload:"This is the actual document/value I want to store!"
}
Create a "meta-document" for each document and use Riak Links to link them up.
I think this is a little bit cleaner but if you need to retrieve these times often then this method may be too slow.
If you want to retrieve documents by time then I don't think Riak (or any other key/value datastores that I know of) will be the right datastore to use. SQL or possibly some BigTable system may be your only good option.
I have written a small bitemporal, open source database layer based on Mongodb:
https://github.com/1123/bitemporaldb
When storing Scala or Java objects, the object is wrapped into a generic bitemporal object with bitemporal meta-information (valid time, transaction time). Subsequently it is serialized to json and stored as BSON in MongoDB.
It handles temporal and non-temporal updates to objects transparently. Search by bitemporal context is possible.
Document-oriented databases for bitemporal data are beneficial, since document oriented storage reduces the number of joins for data retrieval. Joins in a bitemporal context can be inefficient and hard to code by hand.
Feedback, contribution and feature-requests are very welcome.
To support a bitemporal (or temporal db model), you need acid transactions to perform the proper DML to update and insert records on two time dimensions (valid/effective time and transaction/system time). See for details on temporal modeling.
The popular NoSQL database like Cassandra, MongoDB, Couchbase, for example, don't have ACID support to perform the necessary record update/insert operations needed to support bitemporal record manipulation. With temporal and bitemporal databases records must never overlap and records must properly be terminated when superseded by succeeding valid/transaction time records.
MarkLogic NoSQL database claims support for bitemporal, but never tried it and is not open source. But you can roll your own solution by using ACID database that effectively functions as a valid/transaction time tracking journal and then use NoSQL for the actual data store. See high-level description of this approach here.
From Wikipedia:
"Bitemporal data is a concept used in a temporal database. It denotes both the valid time and transaction time of the data.
In a database table bitemporal data is often represented by four extra table-columns StartVT and EndVT, StartTT and EndTT. Each time interval is closed at its lower bound, and open at its upper bound."
So you can't just put these four values onto your data?
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
It's kinda a noob question but what is the difference between Relational Database Management System and database engine?
Thanks.
The original idea of an RDBMS differs from what is called an RDBMS these days. SQL DBMSs are commonly called RDBMSs, but it's more correct to say they can be used mostly relationally, if one has the knowledge and discipline. It's also possible to use them in the style of a network data model or even inconsistently, which seems like the more common attitude in the field.
The essence of the relational model is not about tables, but about first-order logic. Tables are simply a general-purpose data structure which can be used to represent relations. For example, a graph can be viewed in a relational way - as a set of ordered pairs - and can be represented as a table (with some rules to ensure the table is interpreted or manipulated correctly). By describing all data using domains, relations, dependencies and constraints, we can develop declarative consistency guarantees and allow any reasonable question to be answered correctly from the data.
A database engine is software that handles the data structure and physical storage and management of data. Different storage engines have different features and performance characteristics, so a single DBMS could use multiple engines. Ideally, they should not affect the logical view of data presented to users of the DBMS.
How easily you can migrate to another DBMS / engine depends on how much they differ. Unfortunately, every DBMS implements somewhat different subsets of the SQL standard, and different engines support different features. Trying to stick to the lowest common denominator tends to produce inefficient solutions. Object-relational mappers reintroduce the network data model and its associated problems which the relational model was meant to address. Other data access middleware generally don't provide a complete or effective data sublanguage.
Whatever approach you choose, changing it is going to be difficult. At least there's some degree of overlap between SQL implementations, and queries are shorter and more declarative than the equivalent imperative code, so I tend to stick to simple queries and result sets rather than using data access libraries or mappers.
A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model where in you can create many tables and have relations between them. While database engine is the underlying software component that a database management system (DBMS) uses to perform the operations from a database
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to develop a plan to move data from SQL server DB to any of the bigdata databases? Some of the questions that I have thought of are :
How big is the data?
What is the expected growth rate for this data?
What kind of queries will be run frequently? eg: look-up, range-scan, full-scan etc
How frequently the data moved from source to destination?
Can anyone help add to this questionnaire?
Firstly, How big is the data doesn't matter! This point barely can be used to decide on which NoSQL DB to use as most NoSQL DBs are made for easy scalability & storage. So all that matters is the query you fire rather than how much data is there. (Unless of course you intend to use it for storage & access of very small amounts of data because they would be a little expensive in many of the NoSQL DBs) Your first question must be Why consider NoSQL? Can't RDBMS handle it?
Expected growth-rate is a considerable parameter but then again not so valid, since most of the NOSQL DBs support storage of large amounts of data (without any scalability issues).
The most important one in your list is What kind of queries will be run?
This matters most since the RDBMS stores data as tuples and its easier to select tuples & output them with smaller amounts of data. Its faster at executing * queries(as its row-wise storage). But coming to NoSQL, most DBs are columnar or Column-oriented DBMS.
Row-oriented system : As data is inserted into the table, it is assigned an internal ID, the rowid that is used internally in the system to refer to data. In this case the records have sequential rowids independent of the user-assigned empid.
Column-oriented systems : A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.
Comparisons between row-oriented and column-oriented databases are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other bottlenecks in computers.
How frequently the data will be moved/accessed? is again a good question as accesses are costly and few of the NoSQL DBs are very slow the first time a query is shot(Eg: Hive).
Other parameters you may consider are :
Are update of rows(data in the table) required? (Hive has problems with updation, you usually have to delete and insert again)
Why are you using the database? (Search, derive relationships or analytics, etc) What type of operations would you want to perform on the data?
Will it require relationship searches? Like in case of Facebook Db(Presto)
Will it require aggregations?
Will it be used to relate various columns to derive insights?(like analytics to be done)
Last but a very important one, Do you want to store that data on HDFS(Hadoop distributed File System) as files or your DB's specific storage format or anything else? This is important since your processing depends on how your data is stored, whether it can be accessed directly or needs a query call which may be time consuming , etc.
couple more pointers
Type of no-sql DB that suits your requirement. i.e. key-value, document, column family and graph databases
CAP theorem to decide which is more critical amongst Consistency, Availability and Partition tolerance
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a service that currently runs on top of a MySQL database and uses JBoss to run the application. The rate of the database growth is accelerating and I am looking to change the setup to improve scaling. The issue is not in a large number of rows nor (yet) a particularly high volume of queries but rather in the large number of BLOBs stored in the db. Particularly the time it takes to create or restore a backup (we use mysqldump and Percona Xtrabackup ) is a concern as well as the fact that we will need to scale horizontally to keep expanding the disk space in the future. At the moment the db size is around 500GB.
The kind of arrangement that I figure would work well for our future needs is a hybrid database that uses both MySQL and some key-value database. The latter would only store the BLOBs. The meta data as well as data for user management and business logic of the application would remain in the MySQL db and benefit from structured tables and full consistency. The application itself would handle the issue of consistency between the databases.
The question is which database to use? There are lots of NoSQL databases to choose from. Here are some points on what qualities I am looking for:
Distributed over multiple nodes, which are flexible to add or remove.
Redundancy of storage, with the database automatically making sure each value object is stored on at least two different nodes.
Value objects' size could range from a few dozen bytes to around 100MB.
The database is accessed from a java EJB application on top of JBoss as well as a program written in C++ that processes the data in the db. Some sort of connector for each would be needed.
No need for structure for the data. A single string or even just a large integer would suffice for the key, pure byte array for the value.
No updates for the value objects are needed, only inserts and deletes. If a particular object is made obsolete by a new object that fulfills the same role, the old object is deleted and a new object with a new key is inserted.
Having looked around a bit, Riak sounds good except for its problems with storing large value objects. Which database would you recommend?
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am looking at CouchDB, which has a number of appealing features over relational databases including:
intuitive REST/HTTP interface
easy replication
data stored as documents, rather than normalised tables
I appreciate that this is not a mature product so should be adopted with caution, but am wondering whether it is actually a viable replacement for an RDBMS (in spite of the intro page saying otherwise - http://couchdb.apache.org/docs/intro.html).
Under what circumstances would CouchDB be a better choice of database than an RDBMS (e.g. MySQL), e.g. in terms of scalability, design + development time, reliability and maintenance.
Are there still cases where an RDBMS is still clearly the right choice?
Is this an either-or choice, or is a hybrid solution more likely to emerge as best practice?
I recently attended the NoSQL conference in London and think I have a better idea now how to answer the original question. I also wrote a blog post, and there are a couple of other good ones.
Key points:
We have accumulated probably 30 years knowledge of adminstering relational databases, so shouldn't replace them without careful consideration; non-relational data stores are less mature than relational ones, and so are inherently more risky to adopt
There are different types of non-relational data store; some are key-value stores, some are document stores, some are graph databases
You could use a hybrid approach, e.g. a combination of RDBMS and graph data store for a social software site
Document data stores (e.g. CouchDB and MongoDB) are probably the closest to relational databases and provide a JSON data structure with all the fields presented hierarchically which avoids having to do table joins, and (some might argue) is an improvement on the traditional object-relational mapping that most applications currently use
Non-relational databases support replication (including master-master); relational databases support replication too but it may not be as comprehensive as the non-relational option
Very large sites such as Twitter, Digg and Facebook use Cassandra, which is built from the ground up to support clustering
Relational databases are probably suitable for 90% of cases
In summary, consensus seems to be "proceed with caution".
Until someone gives a more in-depth answer, here are some pros and cons for CouchDB
Pros:
you don't need to fit your data into one of those pesky higher-order normal forms
you can change the "schema" of your data at any time
your data will be indexed exactly for your queries, so you will get results in constant time.
Cons:
you need to create views for each and every query, i.e. ad-hoc like queries (such as concatenating dynamic WHERE's and SORT's in an SQL) queries are not available.
you will either have redundant data, or you will end up implementing join and sort logic yourself on "client-side" (e.g. sorting a many-to-many relationship on multiple fields)
Pros or Cons:
creating your views are not as straightforward as in SQL, it's more like solving a puzzle. Depends on your type if this is a pro or a con :)
CouchDB is one of several available 'key/value stores', others include oldies like BDB, web-oriented ones like Persevere, MongoDB and CouchDB, new super-fast like memcached (RAM-only) and Tokyo Cabinet, and huge stores like Hadoop and Google's BigTable (MongoDB also claims to be on this space).
There's certainly space for both key/value stores and relational DBs. Traditionally, most RDBs are considered a layer above key/value. For example, MySQL used to use BDB as an optional backend for tables. In short, key/values know nothing about fields and relationships, which are the foundations of SQL.
Key/value stores typically are easier to scale, which makes them an attractive choice when growing explosively, like Twitter did. Of course, that means that any relationships between the stored values have to be managed on your code, instead of just declared in SQL. CouchDB's approach is to store big 'documents' in the value part, making them (mostly) self contained, so you can get most of the needed data in a single query. Many use cases fit on this idea, others don't.
The current theme I see is that after the "Rails doesn't scale!!" scare, now many people is realizing that it's not about your web framework; but about intelligent cacheing, to avoid hitting the database, and even the webapp when possible. The rising star there is memcached.
As always, it all depends on your needs.
This one is a hard question to answer. So I'll try to highlight the areas where CouchDB might work against you.
The two greatest sources of difficulty on the Couch Users and Dev mailing lists that people have are:
Complex Joins of Data.
Multi-Step Map/Reduce.
Couch Views are pretty much islands unto themselves. If you need to aggregate/merge/intersect a set of views you pretty much have to do so in the application layer for now. There are some tricks you can do with view collation and complex keys to help with joins but these only go so far for some types of data. This may or may not be livable for different applications. That being said many times this problem can reduced or eliminated by structuring your data differently.
The comments of the other folks on this question demonstrate some of the different types of data that are well suited to CouchDB.
One other thing to keep in mind is that a lot of times the data you might need to combine/merge/intersect would be data that you would do offline in an RDBMS database anyway so you might not lose anything by doing the same in CouchDB.
Short Answer: I think eventually CouchDB will be able to handle any kind of problem you want to throw at it. But the comfort level you have using it may differ from developer to developer. It's somewhat subjective I think. I happen to like using a turing complete language to query my data with and keeping more logic in the application layer. Your mileage may vary.
Sam you have to take another approch with CouchDB and in general with map or document based database. You can't define a constraint, such a unique, but you can query data to check if that email is used and if that login is used too. That's the right approch, you have to change your mind.
Correct me if I am wrong. Couchdb is useless for the cases when you need to validate uniqueness of docs over multiple fields. For example it's impossible to enforce validation rule like "both login and email required to be unique" and keep data in consustent state. You can check that before saving the doc, but someone can push before you and data becomes inconsistent.
If you are working with tabular data where there is only a shallow data hierarchy, than an RDBMS system is probably your best choice. This is the main use for RDBMS systems, and the documentation and tool support is very good.
For more nested data like xml, a document database should provide faster access to your data. Also, the storage model more closely resembles that of the data, so retrieval should be more straight forward.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
When would one choose a key-value data store over a relational DB? What considerations go into deciding one or the other? When is mix of both the best route? Please provide examples if you can.
Key-value, heirarchical, map-reduce, or graph database systems are much closer to implementation strategies, they are heavily tied to the physical representation. The primary reason to choose one of these is if there is a compelling performance argument and it fits your data processing strategy very closely. Beware, ad-hoc queries are usually not practical for these systems, and you're better off deciding on your queries ahead of time.
Relational database systems try to separate the logical, business-oriented model from the underlying physical representation and processing strategies. This separation is imperfect, but still quite good. Relational systems are great for handling facts and extracting reliable information from collections of facts. Relational systems are also great at ad-hoc queries, which the other systems are notoriously bad at. That's a great fit in the business world and many other places. That's why relational systems are so prevalent.
If it's a business application, a relational system is almost always the answer. For other systems, it's probably the answer. If you have more of a data processing problem, like some pipeline of things that need to happen and you have massive amounts of data, and you know all of your queries up front, another system may be right for you.
If your data is simply a list of things and you can derive a unique identifier for each item, then a KVS is a good match. They are close implementations of the simple data structures we learned in freshman computer science and do not allow for complex relationships.
A simple test: can you represent your data and all of its relationships as a linked list or hash table? If yes, a KVS may work. If no, you need an RDB.
You still need to find a KVS that will work in your environment. Support for KVSes, even the major ones, is nowhere near what it is for, say, PostgreSQL and MySQL/MariaDB.
IMO, Key value pair (e.g. NoSQL databases) works best when the underlying data is unstructured, unpredictable, or changing often. If you don't have structured data, a relational database is going to be more trouble than its worth because you will need to make lots of schema changes and/or jump through hoops to conform your data to the structure.
KVP / JSON / NoSql is great because changes to the data structure do not require completely refactoring the data model. Adding a field to your data object is simply a matter of adding it to the data. The other side of the coin is there are fewer constraints and validation checks in a KVP / Nosql database than a relational database so your data might get messy.
There are performance and space saving benefits for relational data models. Normalized relational data can make understanding and validating the data easier because there are table key relationships and constraints to help you out.
One of the worst patterns i've seen is trying to have it both ways. Trying to put a key-value pair into a relational database is often a recipe for disaster. I would recommend using the technology that suits your data foremost.
If you want O(1) lookups of values based on keys, then you want a KV store. Meaning, if you have data of the form k1={foo}, k2={bar}, etc, even when the values are larger/ nested structures, and want fast lookups, you want a KV store.
Even with proper indexing, you cannot achieve O(1) lookups in a relational DB for arbitrary keys. Sometimes this is referred to as "random lookups".
Alliteratively stated, if you only ever query by one column, a "primary key" if you will, to retrieve the rest of the data, then using that column as a keyspace and the rest of the data as a value in a KV store is the most efficient way to do lookups.
In contrast, if you often query the data by any of several columns, aka you support a richer query API for the data, then you may want a relational database.
A traditional relational database has problems scaling beyond a point. Where that point is depends a bit on what you are trying to do.
All (most?) of the suppliers of cloud computing are providing key-value data stores.
However, if you have a reasonably sized application with a complicated data structure, then the support that you get from using a relational database can reduce your development costs.
In my experience, if you're even asking the question whether to use traditional vs esoteric practices, then go traditional. While esoteric practices are sexy, challenging, and fun, 99.999% of applications call for a traditional approach.
With regards to relational vs KV, the question you should be asking is:
Why would I not want to use a relational model for this scenario: ...
Since you have not described the scenario, it's impossible for anyone to tell you why you shouldn't use it. The "catch all" reason for KV is scalability, which isn't a problem now. Do you know the rules of optimization?
Don't do it.
(for experts only) Don't do it now.
KV is a highly optimized solution to scalability that will most likely be completely unecessary for your application.