The requirements for the project I'm working seem to point to using both a relational database (e.g. postgre, MySQL) in combination with a key-value store (e.g. HBase, Cassandra). Our data almost breaks nicely into one of the two data models with the exception of a small amount of interdependence.
This is not an attempt to cram a relational database into a key-value store; they are independent of each other.
Are there any serious reasons to not do this?
It should work fine.
There are a couple of things you need to be aware of / watch out for:
Your program is now responsible for the data consistency between the stores, not the relational model.
Depending on your technology you may or may not have transactions that span the data stores. Here you might have to program some manual clean up work in the case of a failure.
I work in SQL DBMS territory, so take that bias into account, but...
As with Shiraz Bhaiji, I worry about the "except for a small amount of interdependence". There are a number of things to think about, the answers to which will help you determine what to do.
What happens if something goes wrong with the interdependence? (Customers lose money - then you need to use a DBMS throughout; you lose money - probably the same; someone gets reported as having 3045 points when they really have 3046 - maybe it doesn't matter.)
How hard is it to fix up the 'mess' when something goes wrong?
How much of the work is on the key-value store and how much is on the DBMS?
Can the interdependence be removed by moving some stuff from key-value store to DBMS?
How slow is the DBMS when used as a key-value store? (Are you sure there's no way to bring it close enough to parity?)
What happens in disaster recovery scenarios? Synchronized backups?
If you have adequate answers to these and related questions, then it is OK to go with the mixed setup - you've thought it through, weighed the risks, formed a judgement, and it is reasonable to go ahead. If you don't have answers, get them.
When you say key-value store are you meaning like in a session or a cache type of implementation? There are always reasons to do such things...reading from and writing to a database is generally your most resource intensive operation. More details?
Related
Let's say we want to store a data that represents all possible English words (eventually), including all its forms, synonyms, parts of speech. Including tenses, common expressions, idioms. And even more - a lot of connections which we even do not know about right now.
Requirements about searching through the words: it should be fast. I could instantly want to get all idioms where word "go" is used. Or I can get all words with 3 letters in it that are related to business (tag) and have the same form in all tenses.
What kind of database you would use for that type of problem?
NOSQL database, like Mongo DB? Probably not - since there are a lot of connections. But in first glance it might be not bad idea - json could represents us picture clearly (for human) ?.
Relational Database, SQL one' like MySql, whateverSQL.. Maybe not - because three might lots of joins, many indexes and this might be even not enough? Or even if use clusters, then it might be mess to understand and support this structure ?
[Graph database][1] ? This seems all about links and connections between objects, and it seems much closer to OO way of representing data (easy to understand having marking connections by some names & types - like connection/association: synonym).But even though it might be slow comparing to relational db(if follow wikipedia)? (I've never worked with this type of dbs). And what about scaling (maybe those type of dbs are not yet proven for real tasks) ??
Create your own?(I would not go this way..)
The questions are:
Is there another type of db /representation of data You could use for this type of task?
Does someone have some strong proven opinion bases on some experience working with similar problems?
I'd try going with a Graph database. You may find some inspiration in this talk: http://skillsmatter.com/podcast/home/case-study-using-graph-theory-graph-databases-to-understand-user-intent/mh-6603
And just to neat-pick a bit: Graph databases are considered NoSQL as well. Check this great talk by Martin Fowler: http://www.youtube.com/watch?v=qI_g07C_Q5I
I would like to know if worth the idea of use graph databases to work specifically with relationships.
I pretend to use relational database for storing entities like "User", "Page", "Comment", "Post" etc.
But in most cases of a typical social graph based workload, I have to get a deep traversals that relational are not good to deal and involves slow joins.
Example: Comment -(made_in)-> Post -(made_in)-> Page etc...
I'm thinking make something like this:
Example:
User id: 1
Query: Get all followers of user_id 1
Query Neo4j for all outcoming edges named "follows" for node user with id 1
With a list of ids query them on the Users table:
SELECT *
FROM users
WHERE user_id IN (ids)
Is this slow?
I have seen this question Is it a good idea to use MySQL and Neo4j together?, but still cannot understand why the correct answer says that that is not a good idea.
Thanks
Using Neo4j is a great choice of technologies for an application like yours, that requires deep traversals. The reason it's a good choice is two-fold: one is that the Cypher language makes such queries very easy. The second is that deep traversals happen very quickly, because of the way the data is structured in the database.
In order to reap both of these benefits, you will want to have both the relationships and the people (as nodes) in the graph. Then you'll be able to do a friend-of-friends query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof
and a friend-of-friend-of-friend query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->()->[:friend]->fofof
RETURN john, fofof
...and so on. (Same idea for posts and comments, just replace the name.)
Using Neo4j alongside MySQL is fine, but I wouldn't do it in this particular way, because the code will be much more complex, and you'll lose too much time hopping between Neo4j and MySQL.
Best of luck!
Philip
In general, the more databases/systems/layers you've got, the more complex the overall setup and operating will be.
Think about all those tasks like synchronization, export/import, backup/archive etc. which become quite expensive if your database(s) grow in size.
People use polyglot persistence only if the benefits of having dedicated and specialized databases outweigh the drawbacks of having to cope with multiple data stores. F.e. this can be the case if you have a large number of data items (activity or transaction logs f.e.), each related to a user. It would probably make no sense to store all the information in a graph database if you're only interested in the connections between the data items. So you would be better off storing only the relations in the graph (and the nodes have just a pointer into the other database), and the data per item in a K/V store or the like.
For your example use case, I would go only for one database, namely Neo4j, because it's a graph.
As the other answers indicate, using Neo4j as your single data store is preferable. However, in some cases, there might not be much choice in the matter where you already have another database behind your product. I would just like to add that if this is the case, running neo4j as your secondary database does work (the product I work on operates in this mode). You do have to work extra hard at figuring out what functionality you expect out of neo4j, what kind of data you need for it,how to keep the data in sync and the consequence of suffering from not always real time results. Most of our use cases can work with near real time results so we are fine. Bit it may not be the case for your product. Still, to me , using neo4j in this mode is still preferable than running without it.
We are able to produce a lot of graphy-great stuff as a result of it.
I have a huge database.
In this Database I have a User-Table.
In this User-table I have all information I can get about an user - adress, username, weight, haircolor and so much more (50-80 coloums I guess).
Now I will have User-settings.
Of course, 1 user can only have 1 setting, so its a 1:1-connection and in the rules of normalitation that I learned years before, the settings should go as a coloum in the user-table.
But logicaly its a big difference between user-information like an adress I will display for user / admins and settings for the website behauvior for an user.
What should I do?
Own table for UserSettings and break the rules of normalitation for big logic difference OR put the settings as coloums in the User-Table and do not break the rules of normalitation for big logic???
Complete normalization is rarely the right approach for large complex databases.
Always think through the pros and cons of your models. Consider the following: Complexity, Performance, Maintenance, Evolution.
If your database is part of an evolving system, then you will almost certainly be changing your models (tables) and relationships at some point in the future.
As a rule of thumb, keeping your models close to real life will bring benefits in the long term. Especially when your client / user comes back with a new feature request.
Try to consider how you model your data in different ways. For example: Your current 'User' record sounds more like a 'Contact' record. Contact records may have other uses than storing system settings - Therefore keeping the two models as separate tables would be the correct solution - Even if the relationship does start out as 1:1.
Creating a separate table for user settings does not break the rules of normalization. If it makes sense for other reasons then I suggest you do it.
It's ok to break the tables apart- for example, the RDBMS very well may be able to avoid a lot of disk seeks (or reading a lot of unneeded data) when reading rows. It depends on your app and how it queries the data.
Two of my colleagues and I are building a system to do all sorts of hydrology and related stuff. It has a lot of requirements and have a good number of tables.
We are handling all sorts of sampling that it is done within this scope (hydrology) and we are trying to figure out a way to do it in a less painful way.
Sometimes we need to get all that sampling together and I'm starting to think we are over-complicating our database design.
How or when do you know that you are over-designing a database? Of course we are considering a lot of Normal Form Rules and other good practices, but when it is OK to drop one of those rules, e.g. not normalizing something?
What are your opinions on this?
Short Answer
You can't, worry about something else.
Long Answer
This sounds like yet another form of premature optimization. (YAFPO?)
You should design your schema using third normal form (3NF). Once designed, you should populate your tables with data and begin profiling.
If a particular query is deemed too costly then you should look into denormalization on a case by case basis.
Technical Answer (for the nitpickers who will inevitably object to: "you can't")
You will reach a limit at some point based on your choice of RDBMS and/or storage engine. Likely ceilings will be memory consumption or open file descriptors.
"When do you have too many tables?"
At the level of logical design, the correct answer is "never".
At the level of physical design (insofar as "having a table" really refers to some concept that pertains to the physical design), the correct answer is "if and when the queries that you need to do, given the restrictions of the DBMS you are using, are causing performance to be unacceptably low.".
We have a system with literally hundreds of tables - its no big deal, its just that a lot of different things are stored in the database.
We have a ton of tables in our system as well. What we did was normalize the database to a good point, then created a few views that encompass the most common table usage needs of our system. Something like that could help you as well.
Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
table_model
type {cadillac, saturn, chevrolet}
Or this?
table_cadillac_model
table_saturn_model
table_chevrolet_model
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
EDIT:
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
EDIT:
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Vehicle
VehicleType
Other common fields
CadillacVehicle
Fields specific to a Caddy
SaturnVehicle
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Edit:
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
table_model
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
...
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.