Database suggestion (and possible readings) for heavy computational website - database

I'm building a website that will rely on heavy computations to make guess and suggestion on objects of objects (considering the user preferences and those of users with similar profiles). Right now I'm using MongoDB for my projects, but I suppose that I'll have to go back to SQL for this one.
Unfortunately my knowledge on the subject is high school level. I know that there are a lot of relational databases, and was wondering about what could have been some of the most appropriate for this kind of heavily dynamic cluster analysis. Also I would really appreciate some suggestion regarding possible readings (would be really nice if free and online, but I won't mind reading a book. Just maybe not a 1k pages one if possible).
Thanks for your help, extremely appreciated.

Recommondations are typically a graph like problem, so you should also consider looking into graph databases, e.g. Neo4j

Related

Database design - How does Salesforce store dynamically created fields on the fly?

I am wondering how does Salesforce design their database to allow user to create dynamic fields.
I am looking to achieve this as well and will be interested to know if anyone have any idea how they did it.
That's not a good question for Stack Overflow. Hardly programming-related.
I'd say check if you really need to reinvent the wheel. If you want freestyle data storage - perhaps any of existing NoSQL solutions will fit your needs. You probably have heard about MongoDB or BigTable (or whatever it is the hip kids do these days because "normal" relational databases are boring). You don't want to design by hand tables with placeholder columns of each appropriate type, then tables that would hold metadata for these (per client), then possibly some indexes on top of that...
This stuff is complex. That's what Salesforce clients pay for, to not have to worry about solving such problems (or hardware used. or scalability).
There must be some serious black magic going on behind the scenes when typically every SF record takes ~2KB of storage space, regardless of how complex it is (how many fields are used in the table). Don't expect to replicate their solution overnight.
But if you really want to try to dive into it - these might be a good start:
Force.com architecture overview
Review(?) of architecture (but the whitepaper from 2008 the article talks about is a dead link. Probably this is the right one)
yet another article
No idea if these will be terribly helpful... But at least you'll end up with list of keywords to go on with?

Basic Database Question?

I am intrested to know a little bit more about databases then i currently know. I know how to setup a database backend for any webapp that i happen to be creating but that is all. For example if i was creating three different apps i would simply create three different databases and then configure each database for the particular app. This is all simple knowledge and i would now like to have a deeper understanding of how databases actually work.
Lets say that I developed an application for example that needed lot of space and processing power.This database would then have to be spread over numerous machines. How exactly would a database be spread across numerous machines and still be able to write records and then retreieve them. Would each table get their own machine and what software is needed to make sure that the different machines have all performed their transactions successfully.
As you can see i am quite a database ignoramus lol.
Any help in clearing this up would be greatly appreciated.
I don't know what RDBMS you're using but I have two book suggestions.
For theory (which should come first, in my opinion): Database in Depth: Relational Theory for Practitioners
For implementation: High Performance MySQL: Optimization, Backups, Replication, and More
I own both these books and they are both pretty great, especially the first one.
That's quite a broad topic... You might want to start with Multi-master replication, High-availability clustering and Massively parallel processing.
If you want to know about how to keep databases running with ever increasing load, then it's not a basic question. Several well known web companies are struggling to find the right way to make their database scalable.
Using memcached to cache database information is one way to decrease load on your database if your application is read-intensive. If you application is write-intensive then may be you would want to consider using a NOSQL datastore like MongoDB or Redis.
Database Design for Mere Mortals
This is the best book about the subject if you don't have any experience with databases. It's got historical background and practical examples. Most books often skip the historical stuff because they assume you know what a db is, or it doesn't matter, and jump right to the practical. This book gives you the complete picture.

How to model this[Networks, details in post] in database for efficiency and ease of use?

At linkedin, when you visit someones profile you can see how you are connected to them. I believe that linkedin shows upto 3rd level connections if not more, something like
shabda -> Foo user, bar user, baz user -> Joel's connection -> Joel
How can I represent this in the database.
If I model as,
User
Id PK
Name Char
Connection
User1 FK
User2 FK
Then to find the network, three levels deep, I need to get all my connection, their connections, and their connections, and then see if the current user is there. This obviously would be very inefficient with DB of any size, and probably clunky to work with as well.
Since, on linked in I can see this network, on any profile I visit, I don't think this is precalculated either.
The other thing which comes to my mind is probably this is best not stored in a relational DB, but then what would be the best way to store and retrieve it?
My recommendation would be to use a graph database. There seems to be only one implementation currently available, and that's Neo4j. It's written in Java, but has bindings to Ruby and Scala (Python in progress).
If you don't know Java, you probably won't be able to find anything similar on any other platform (yet), unfortunately. However, if you do know Java (or are at least willing to learn), it's way worth it. (Technically you don't even need to learn Java because of the Ruby/Python bindings.) Neo4j was built for exactly what you're trying to do. You'd go through a ton of trouble trying to implement this in a relational database, when you'd be able to do the exact same thing in only a few lines of Java code, and also much more efficiently.
If that's not an option, I'd still recommend looking at other database types such as object databases. Relational databases weren't built for this kind of thing, and you'd go through more pain by trying to do it in an RDBMS than by switching to a different kind of database and learning it.
I don't see why there's anything wrong with using a relational database for this. The tables defined in the question are an excellent start. With proper optimization you'll be able to keep your performance well in hand. I personally think you would need something serious to justify shifting away from such a versatile mainstream product. You'll probably need an RBDMS in the project anyway and there are an unmatchable amount of legitimate choices in many price ranges (even free). You'll get quality documentation, support will be available, and you'll have a large supply of highly trained developers available in the job pool.
Regarding this model of self-relationships (users joined to other users), I recommend looking into recursive queries. That will keep you from performing a cascade of individual queries to find 3 levels of relationships. Consider the following SQL Server method for performing recursive queries with CTEs.
http://msdn.microsoft.com/en-us/library/ms186243.aspx
It allows you to specify how deep you want to go with the MAXRECURSION hint.
Next, you need to start thinking of ways to optimize. That starts with standard best-practices for setting up your tables with proper indexes and maintenance, etc. It inevitably ends with denormalization. That's one of those things you only do once you've already tried everything else, but if you know what you're doing and use good practices then your performance gain will be significant. There are many resources on the internet to help you learn about denormalization, just look it up.

What kind of database would be suited to maintain a relatively large list of items with a counter for each item that is updated often in real time?

Let's pretend it's for word frequency counts in a web crawler. Is relational the way to go (I'm imagining a simple two-column table) or is there a NoSQL option better suited to this task?
When I say better, I mean more conceptually suited to the task. I'm not really concerned with scalability, just simplicity and an obvious conceptual mapping to the task at hand. In the way that, for me at least, CouchDB maps much more sensibly to a blog than MySQL does.
If this is a thing that you'll only run on one machine I'd just use an internal datastructure, a red-black tree or perhaps a trie for something as simple and small as this..
Or I'd embed a key/value pair database such as BerkeleyDB.
first - are you looking for paid database or a free one ?
if you have 1 or 2 tables maybe 1 or 2 indexes, no need for enterprise features (clustering/replication/Flashbacks) then any DB can do good from mysql (the free one) to sql express(still free) , sql server and oracle which both are commercial and cost money.
you need to understand that the hardware might have a role here too, as the schema looks very simple and there wouldn't be much optimization possible - but then again - I don't know your exact needs...
(on the other hand if you're talking about extreme large tables with a lot of read and writes - and I mean A LOT you might need a configuration of 2 active nodes and other advanced features than might push you towards paid databases)
Have a look at the papers and ideas behind Google BigTable (and the MapReduce operations possible on it). There's other implementations that think in that box; you're really implementing a distributed hash table, to give you some juice to throw at Google.

Is any group or foundation developing an algorithm for better storing massive amounts of data?

I've looked at several approaches to enterprise architecture for databases that store massive amounts of data, and it usually comes down to more hardware, database sharding, and storing JSON objects. Has any group been doing research, or does anyone have a more dynamic approach that processes the available data and tells you how to better store it, and then instructs you how to retrieve it given the new method of storage? I know it sounds a bit fanciful, but I figured I would ask anyway.
You might find this interesting:
http://en.wikipedia.org/wiki/BigTable
Very interesting question. It seems to me like the Semantic Web folks may have to deal with this issue before too long. It also seems to me that they've got some technologies that might provide at least part of the solution. Have a look at the OWL specs, for instance.

Resources