I am thinking about creating something like a tree- or networklike guide with Java objects with mapping it into a DB.
Following each step leads to another and so on, the following question/task/whatever depends on the former action. (see picture) It should be possible to create cycles for e. g. repeating some former steps.
What database should I prefer? Standard relational ones, connecting a table maybe with itself (foreign key -> primary key) to connect the nodes or some document-based (graph-based) like OrientDB, creating real trees? What about object-oriented databases like db4o?
What would have the better performance and/or be easier to realize?
Thanks in advance.
Additional thoughts:
I probably would add different actions (calls of webservices, whatsoever) and/or media (text, images, videos) to one node (step), leading to other steps, maybe getting back to a former one and so on.
I think you're on the right track with graph databases. OrientDB looks like it would work really well. Also of course I'll throw out Neo4j, which should work just as good.
A relational database in my opinion is easy to toss out because it doesn't map well to the problem space. You might be able to create good structures to store relationships etc., but the queries will be horrific and complex.
The most important question I would ask is, "What does the query language look like?" Storing data usually isn't that hard to do. You can design structures and load data just fine.
When you need to query and think about your data though ... is the query language easy to reason about? Does it represent the concepts that are important to you?
For graphs, I really like Gremlin. https://github.com/thinkaurelius/titan/wiki/Gremlin-Query-Language
It has a pretty natural syntax for conveying graph concepts. I'd check that out as an alternative to OrientDB.
Related
I'm investigating a new project which will be a social networking style site. I'm reading up on RavenDb and I like the look of a lot of its features. I've not read up on nosql all that much but I'm wondering if there's a niche it fits best with and old school sql is still the best choice for other stuff.
I'm thinking that the permissions plug in would be ideal for a social net style site - but will it really perform in an environment where the database will be getting hammered - or is it optimised for a more reporting style system where it's possible to keep throwing new data structures at the database and report on those structures.
I'm eager to use the right tool for the job - I'll be using MVC3, Windsor + either Nhibernate+Sql server or RavenDb.
Should I stick with the old school sql or go with the new kid on the block: ravendb?
This question can get very close to being subjective (even though it's really not), you're talking about NoSQL as if it is just one thing, and that is not the case.
You have
graph databases (Neo4j etc),
map/reduce style document databases (Couch,Raven),
document databases which attempt to feel like ordinary databases (Mongo),
Key/value stores (Cassandra etc)
moar goes here.
Each of them attempts to solve a different problem via different means, and whether you'd use one of them over a traditional relational store is
A matter of suitability
A matter of personal preference
At the end of the day, for the primary data-storage for a single system, a document database or relational store is probably what you want, although for different parts of your system you may well end up utilising a graph database (For calculating neighbours etc), or a key/value store (like Facebook does/did for inbox messages).
The main benefit of choosing a document store as your primary store over that of a relational one, is that you haven't got to worry about trying to map your objects into a collection of tables, and there is less configuration overhead involved in doing so.
The other downside/upside would be that you have to learn something new and make mistakes along the way.
So my answer if I am going to be direct?
RavenDB would be suitable
SQL would be suitable
Which do you prefer to use? These days I'd probably just go for Raven, knowing I can dump data into a relational store for reporting purposes and probably do likewise for other parts of my system, and getting free-text search and fastish-writes/fast-reads without going through the effort of defining separate read/write stores is an overall win.
But that's me, and I am biased.
I started my first MySQL project designing the ERD, logical and physical diagrams.
A friend of mine is making the same project as me. I started the plan of my databases by making an ERD and then normalizing.
However, he uses a relational database diagrams where he designs interfaces and other parts first before making the ERD. He for example writes "stack" only to the column phonenumbers, instead of making a "help-table".He says that it is best to make the interfaces first and then make the ERD.
Which one of us is doing the plan in your opinion better?
One could, and many people have, written a book on this.
However to generalise what I would generally do is
Analyse your data and reduce it down to 3rd normal form. This should be pretty formulaic to accomplish.
In light of likely business use of the data decide if and where data should be denormalized. Typically most databases will be overwhelmingly in 3rd normal with a few critical exceptions. This part is where experience and craft come in.
In light of the above create any additional indexes that may be necessary, or modify existing primary indexes (which should have been assigned in phase 1).
Create views for user access as necessary. The number you require may vary from none (as in simple embedded application) to many (as in no direct data access to tables allowed).
Create any procedures you need, and possibly triggers (generally best avoided but appropriate for audit purposes).
In practice of course the process is considerably more iterative, but the general design path from data to interface holds true. Also it's a good idea when designing a database to keep in mind that you will want to change it later and try if possible to make that a reasonably straightforward task.
I'm not sure what you mean by "interfaces and operations", but the way you're designing the schemas is right -- doing an ERD and properly normalizing. A lot of people, when they are just starting out, will take shortcuts on the design in order to fit the schema to their current level of querying skills.
For example, instead of creating a table of phone numbers and mapping these phone numbers to a "customers" table, they might just stick in columns called Phone1 Phone2 Phone3... instead. That can be the kiss of death later on when designing your queries.
So my advice... Create the normalized data model with ERDs. Then read up on VIEWs and user-defined functions in order to "flatten" out your schema where necessary for people who wish to query it. Sorry for the general answer, but it's kind of a general question...
At linkedin, when you visit someones profile you can see how you are connected to them. I believe that linkedin shows upto 3rd level connections if not more, something like
shabda -> Foo user, bar user, baz user -> Joel's connection -> Joel
How can I represent this in the database.
If I model as,
User
Id PK
Name Char
Connection
User1 FK
User2 FK
Then to find the network, three levels deep, I need to get all my connection, their connections, and their connections, and then see if the current user is there. This obviously would be very inefficient with DB of any size, and probably clunky to work with as well.
Since, on linked in I can see this network, on any profile I visit, I don't think this is precalculated either.
The other thing which comes to my mind is probably this is best not stored in a relational DB, but then what would be the best way to store and retrieve it?
My recommendation would be to use a graph database. There seems to be only one implementation currently available, and that's Neo4j. It's written in Java, but has bindings to Ruby and Scala (Python in progress).
If you don't know Java, you probably won't be able to find anything similar on any other platform (yet), unfortunately. However, if you do know Java (or are at least willing to learn), it's way worth it. (Technically you don't even need to learn Java because of the Ruby/Python bindings.) Neo4j was built for exactly what you're trying to do. You'd go through a ton of trouble trying to implement this in a relational database, when you'd be able to do the exact same thing in only a few lines of Java code, and also much more efficiently.
If that's not an option, I'd still recommend looking at other database types such as object databases. Relational databases weren't built for this kind of thing, and you'd go through more pain by trying to do it in an RDBMS than by switching to a different kind of database and learning it.
I don't see why there's anything wrong with using a relational database for this. The tables defined in the question are an excellent start. With proper optimization you'll be able to keep your performance well in hand. I personally think you would need something serious to justify shifting away from such a versatile mainstream product. You'll probably need an RBDMS in the project anyway and there are an unmatchable amount of legitimate choices in many price ranges (even free). You'll get quality documentation, support will be available, and you'll have a large supply of highly trained developers available in the job pool.
Regarding this model of self-relationships (users joined to other users), I recommend looking into recursive queries. That will keep you from performing a cascade of individual queries to find 3 levels of relationships. Consider the following SQL Server method for performing recursive queries with CTEs.
http://msdn.microsoft.com/en-us/library/ms186243.aspx
It allows you to specify how deep you want to go with the MAXRECURSION hint.
Next, you need to start thinking of ways to optimize. That starts with standard best-practices for setting up your tables with proper indexes and maintenance, etc. It inevitably ends with denormalization. That's one of those things you only do once you've already tried everything else, but if you know what you're doing and use good practices then your performance gain will be significant. There are many resources on the internet to help you learn about denormalization, just look it up.
I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).
I've been trying to see if I can accomplish some requirements with a document based database, in this case CouchDB. Two generic requirements:
CRUD of entities with some fields which have unique index on it
ecommerce web app like eBay (better description here).
And I'm begining to think that a Document-based database isn't the best choice to address these requirements. Furthermore, I can't imagine a use for a Document based database (maybe my imagination is too limited).
Can you explain to me if I am asking pears from an elm when I try to use a Document oriented database for these requirements?
You need to think of how you approach the application in a document oriented way. If you simply try to replicate how you would model the problem in an RDBMS then you will fail. There are also different trade-offs that you might want to make. ([ed: not sure how this ties into the argument but:] Remember that CouchDB's design assumes you will have an active cluster of many nodes that could fail at any time. How is your app going to handle one of the database nodes disappearing from under it?)
One way to think about it is to imagine you didn't have any computers, just paper documents. How would you create an efficient business process using bits of paper being passed around? How can you avoid bottlenecks? What if something goes wrong?
Another angle you should think about is eventual consistency, where you will get into a consistent state eventually, but you may be inconsistent for some period of time. This is anathema in RDBMS land, but extremely common in the real world. The canonical transaction example is of transferring money from bank accounts. How does this actually happen in the real world - through a single atomic transactions or through different banks issuing credit and debit notices to each other? What happens when you write a cheque?
So lets look at your examples:
CRUD of entities with some fields with unique index on it.
If I understand this correctly in CouchDB terms, you want to have a collection of documents where some named value is guaranteed to be unique across all those documents? That case isn't generally supportable because documents may be created on different replicas.
So we need to look at the real world problem and see if we can model that. Do you really need them to be unique? Can your application handle multiple docs with the same value? Do you need to assign a unique identifier? Can you do that deterministically? A common scenario where this is required is where you need a unique sequential identifier. This is tough to solve in a replicated environment. In fact if the unique id is required to be strictly sequential with respect to time created it's impossible if you need the id straight away. You need to relax at least one of those constraints.
ecommerce web app like ebay
I'm not sure what to add here as the last comment you made on that post was to say "very useful! thanks". Was there something missing from the approach outlined there that is still causing you a problem? I thought MrKurt's answer was pretty full and I added a little enhancement that would reduce contention.
Is there a need to normalize the data?
Yes: Use relational.
No: Use document.
I am in the same boat, I am loving couchdb at the moment, and I think that the whole functional style is great. But when exactly do we start to use them in ernest for applications. I mean, yes we can all start to develop applications extremely quickly, cruft free with all those nasty hang-ups about normal form being left in the wayside and not using schemas. But, to coin a phrase "we are standing on the shoulders of giants". There is a good reason to use RDBMS and to normalise and to use schemas. My old oracle head is reeling thinking about data without form.
My main wow factor on couchdb is the replication stuff and the versioning system working in tandem.
I have been racking my brain for the last month trying to grok the storage mechanisms of couchdb, apparently it uses B trees but doesn't store data based on normal form. Does this mean that it is really really smart and realises that bits of data are replicated so lets just make a pointer to this B tree entry?
So far I am thinking of xml documents, config files, resource files streamed to base64 strings.
But would I use couchdb for structural data. I don't know, any help greatly appreciated on this.
Might be useful in storing RDF data or even free form text.
A possibility is to have a main relational database that stores definitions of items that can be retrieved by their IDs, and a document database for the descriptions and/or specifications of those items. For example, you could have a relational database with a Products table with the following fields:
ProductID
Description
UnitPrice
LotSize
Specifications
And that Specifications field would actually contain a reference to a document with the technical specifications of the product. This way, you have the best of both worlds.
Document based DBs are best suiting for storing, well, documents. Lotus Notes is a common implementation and Notes email is an example. For what you are describing, eCommerce, CRUD, etc., realtional DBs are better designed for storage and retrieval of data items/elements that are indexed (as opposed to documents).
Re CRUD: the whole REST paradigm maps directly to CRUD (or vice versa). So if you know that you can model your requirements with resources (identifiable via URIs) and a basic set of operations (namely CRUD), you may be very near to a REST-based system, which quite a few document-oriented systems provide out of the box.