How to think in data stores instead of databases? - database

As an example, Google App Engine uses Google Datastore, not a standard database, to store data. Does anybody have any tips for using Google Datastore instead of databases? It seems I've trained my mind to think 100% in object relationships that map directly to table structures, and now it's hard to see anything differently. I can understand some of the benefits of Google Datastore (e.g. performance and the ability to distribute data), but some good database functionality is sacrificed (e.g. joins).
Does anybody who has worked with Google Datastore or BigTable have any good advice to working with them?

There's two main things to get used to about the App Engine datastore when compared to 'traditional' relational databases:
The datastore makes no distinction between inserts and updates. When you call put() on an entity, that entity gets stored to the datastore with its unique key, and anything that has that key gets overwritten. Basically, each entity kind in the datastore acts like an enormous map or sorted list.
Querying, as you alluded to, is much more limited. No joins, for a start.
The key thing to realise - and the reason behind both these differences - is that Bigtable basically acts like an enormous ordered dictionary. Thus, a put operation just sets the value for a given key - regardless of any previous value for that key, and fetch operations are limited to fetching single keys or contiguous ranges of keys. More sophisticated queries are made possible with indexes, which are basically just tables of their own, allowing you to implement more complex queries as scans on contiguous ranges.
Once you've absorbed that, you have the basic knowledge needed to understand the capabilities and limitations of the datastore. Restrictions that may have seemed arbitrary probably make more sense.
The key thing here is that although these are restrictions over what you can do in a relational database, these same restrictions are what make it practical to scale up to the sort of magnitude that Bigtable is designed to handle. You simply can't execute the sort of query that looks good on paper but is atrociously slow in an SQL database.
In terms of how to change how you represent data, the most important thing is precalculation. Instead of doing joins at query time, precalculate data and store it in the datastore wherever possible. If you want to pick a random record, generate a random number and store it with each record. There's a whole cookbook of this sort of tips and tricks here.

The way I have been going about the mind switch is to forget about the database altogether.
In the relational db world you always have to worry about data normalization and your table structure. Ditch it all. Just layout your web page. Lay them all out. Now look at them. You're already 2/3 there.
If you forget the notion that database size matters and data shouldn't be duplicated then you're 3/4 there and you didn't even have to write any code! Let your views dictate your Models. You don't have to take your objects and make them 2 dimensional anymore as in the relational world. You can store objects with shape now.
Yes, this is a simplified explanation of the ordeal, but it helped me forget about databases and just make an application. I have made 4 App Engine apps so far using this philosophy and there are more to come.

I always chuckle when people come out with - it's not relational. I've written cellectr in django and here's a snippet of my model below. As you'll see, I have leagues that are managed or coached by users. I can from a league get all the managers, or from a given user I can return the league she coaches or managers.
Just because there's no specific foreign key support doesn't mean you can't have a database model with relationships.
My two pence.
class League(BaseModel):
name = db.StringProperty()
managers = db.ListProperty(db.Key) #all the users who can view/edit this league
coaches = db.ListProperty(db.Key) #all the users who are able to view this league
def get_managers(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.managers)
def get_coaches(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.coaches)
def __str__(self):
return self.name
# Need to delete all the associated games, teams and players
def delete(self):
for player in self.leagues_players:
player.delete()
for game in self.leagues_games:
game.delete()
for team in self.leagues_teams:
team.delete()
super(League, self).delete()
class UserPrefs(db.Model):
user = db.UserProperty()
league_ref = db.ReferenceProperty(reference_class=League,
collection_name='users') #league the users are managing
def __str__(self):
return self.user.nickname
# many-to-many relationship, a user can coach many leagues, a league can be
# coached by many users
#property
def managing(self):
return League.gql('WHERE managers = :1', self.key())
#property
def coaching(self):
return League.gql('WHERE coaches = :1', self.key())
# remove all references to me when I'm deleted
def delete(self):
for manager in self.managing:
manager.managers.remove(self.key())
manager.put()
for coach in self.managing:
coach.coaches.remove(self.key())
coaches.put()
super(UserPrefs, self).delete()

I came from Relational Database world then i found this Datastore thing. it took several days to get hang of it. well there are some of my findings.
You must have already know that Datastore is build to scale and that is the thing that separates it from RDMBS. to scale better with large dataset, App Engine has done some changes(some means lot of changes).
RDBMS VS DataStore
Structure
In database, we usually structure our data in Tables, Rows which is in Datastore it becomes Kinds and Entities.
Relations
In RDBMS, Most of the people folllows the One-to-One, Many-to-One, Many-to-Many relationship, In Datastore, As it has "No Joins" thing but still we can achieve our normalization using "ReferenceProperty" e.g. One-to-One Relationship Example .
Indexes
Usually in RDMBS we make indexes like Primary Key, Foreign Key, Unique Key and Index key to speed up the search and boost our database performance. In datastore, you have to make atleast one index per kind(it will automatically generate whether you like it or not) because datastore search your entity on the basis of these indexes and believe me that is the best part, In RDBMS you can search using non-index field though it will take some time but it will. In Datastore you can not search using non-index property.
Count
In RDMBS, it is much easier to count(*) but in datastore, Please dont even think it in normal way(Yeah there is a count function) as it has 1000 Limit and it will cost as much small opertion as the entity which is not good but we always have good choices, we can use Shard Counters.
Unique Constraints
In RDMBS, We love this feature right? but Datastore has its own way. you cannot define a property as unique :(.
Query
GAE Datatore provides a better feature much LIKE(Oh no! datastore does not have LIKE Keyword) SQL which is GQL.
Data Insert/Update/Delete/Select
This where we all are interested in, as in RDMBS we require one query for Insert, Update, Delete and Select just like RDBMS, Datastore has put, delete, get(dont get too excited) because Datastore put or get in terms of Write, Read, Small Operations(Read Costs for Datastore Calls) and thats where Data Modeling comes into action. you have to minimize these operations and keep your app running. For Reducing Read operation you can use Memcache.

Take a look at the Objectify documentation. The first comment at the bottom of the page says:
"Nice, although you wrote this to describe Objectify, it is also one of the most concise explanation of appengine datastore itself I've ever read. Thank you."
https://github.com/objectify/objectify/wiki/Concepts

If you're used to thinking about ORM-mapped entities then that's basically how an entity-based datastore like Google's App Engine works. For something like joins, you can look at reference properties. You don't really need to be concerned about whether it uses BigTable for the backend or something else since the backend is abstracted by the GQL and Datastore API interfaces.

The way I look at datastore is, kind identifies table, per se, and entity is individual row within table. If google were to take out kind than its just one big table with no structure and you can dump whatever you want in an entity. In other words if entities are not tied to a kind you pretty much can have any structure to an entity and store in one location (kind of a big file with no structure to it, each line has structure of its own).
Now back to original comment, google datastore and bigtable are two different things so do not confuse google datastore to datastore data storage sense. Bigtable is more expensive than bigquery (Primary reason we didn't go with it). Bigquery does have proper joins and RDBMS like sql language and its cheaper, why not use bigquery. That being said, bigquery does have some limitations, depending on size of your data you might or might not encounter them.
Also, in terms of thinking in terms of datastore, i think proper statement would have been "thinking in terms of NoSQL databases". There are too many of them available out there these days but when it comes to google products except google cloud SQL (which is mySQL) everything else is NoSQL.

Being rooted in the database world, a data store to me would be a giant table (hence the name "bigtable"). BigTable is a bad example though because it does a lot of other things that a typical database might not do, and yet it is still a database. Chances are unless you know you need to build something like Google's "bigtable", you will probably be fine with a standard database. They need that because they are handling insane amounts of data and systems together, and no commercially available system can really do the job the exact way they can demonstrate that they need the job to be done.
(bigtable reference: http://en.wikipedia.org/wiki/BigTable)

Related

Using Doctrine 2 for large data models

I have a legacy in-house human resources web app that I'd like to rebuild using more modern technologies. Doctrine 2 is looking good. But I've not been able to find articles or documentation on how best to organise the Entities for a large-ish database (120 tables). Can you help?
My main problem is the Person table (of course! it's an HR system!). It currently has 70 columns. I want to refactor that to extract several subsets into one-to-one sub tables, which will leave me with about 30 columns. There are about 50 other supporting one-to-many tables called person_address, person_medical, person_status, person_travel, person_education, person_profession etc. More will be added later.
If I put all the doctrine associations (http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/working-with-associations.html) in the Person entity class along with the set/get/add/remove methods for each, along with the original 30 columns and their methods, and some supporting utility functions then the Person entity is going to be 1000+ lines long and a nightmare to test.
FWIW i plan to create a PersonRepository to handle the common bulk queries, a PersonProfessionRepository for the bulk queries / reports on that sub table etc, and Person*Service s which will contain some of the more complex business logic where needed. So organising the rest of the app logic is fine: this is a question about how to correctly organise lots of sub-table Entities with Doctrine that all have relationships / associations back to one primary table. How do I avoid bloating out the Person entity class?
Identifying types of objects
It sounds like you have a nicely normalized database and I suggest you keep it that way. Removing columns from the people table to create separate tables for one-to-one relations isn't going to help in performance nor maintainability.
The fact that you recognize several groups of properties in the Person entity might indicate you have found cases for a Value Object. Even some of the one-to-many tables (like person_address) sound more like Value Objects than Entities.
Starting with Doctrine 2.5 (which is not yet stable at the time of this writing) it will support embedding single Value Objects. Unfortunately we will have to wait for a future version for support of collections of Value objects.
Putting that aside, you can mimic embedding Value Objects, Ross Tuck has blogged about this.
Lasagna Code
Your plan of implementing an entity, repository, service (and maybe controller?) for Person, PersonProfession, etc sounds like a road to Lasagna Code.
Without extensive knowledge about your domain, I'd say you want to have an aggregate Person, of which the Person entity is the aggregate root. That aggregate needs a single repository. (But maybe I'm off here and being simplistic, as I said, I don't know your domain.)
Creating a service for Person (and other entities / value objects) indicates data-minded thinking. For services it's better to think of behavior. Think of what kind of tasks you want to perform, and group coherent sets of tasks into services. I suspect that for a HR system you'll end up with many services that evolve around your Person aggregate.
Is Doctrine 2 suitable?
I would say: yes. Doctrine itself has no problems with large amounts of tables and large amounts of columns. But performance highly depends on how you use it.
OLTP vs OLAP
For OLTP systems an ORM can be very helpful. OLTP involves many short transactions, writing a single (or short list) of aggregates to the database.
For OLAP systems an ORM is not suited. OLAP involves many complex analytical queries, usually resulting in large object-graphs. For these kind of operations, native SQL is much more convenient.
Even in case of OLAP systems Doctrine 2 can be of help:
You can use DQL queries (in stead of native SQL) to use the power of your mapping metadata. Then use scalar or array hydration to fetch the data.
Doctrine also support arbitrary joins, which means you can join entities that are not associated to each other according by mapping metadata.
And you can make use of the NativeQuery object with which you can map the results to whatever you want.
I think a HR system is a perfect example of where you have both OLTP and OLAP. OLTP when it comes to adding a new Person to the system for example. OLAP when it comes to various reports and analytics.
So there's nothing wrong with using an ORM for transactional operations, while using plain SQL for analytical operations.
Choose wisely
I think the key is to carefully choose when to use what, on a case by case basis.
Hydrating entities is great for transactional operations. Make use of lazy loading associations which can prevent fetching data you're not going to use. But also choose to eager load certain associations (using DQL) where it makes sense.
Use scalar or array hydration when working with large data sets. Data sets usually grow where you're doing analytical operations, where you don't really need full blown entities anyway.
#Quicker makes a valid point by saying you can create specialized View objects. You can fetch only the data you need in specific cases and manually mold that data into objects. This is accompanied by his point to don't bloat the user interface with options a user with a certain role doesn't need.
A technique you might want to look into is Command Query Responsibility Segregation (CQRS).
I understood that you have a fully normalized table persons and now you are asking for how to denormalize that best.
As long as you do not hit any technical constaints (such as max 64 K Byte) I find 70 columns definitly not overloaded for a persons table in a HR system. Do yourself a favour to not segment that information for following reasons:
selects potentially become more complex
each extract table needs (an) extra index/indeces, which increases your overall memory utilization -> this sounds to be a minor issue as disk is cheap. However keep in mind that via caching the RAM to disk space utilization ratio determines your performance to a huge extend
changes become more complex as extra relations demand for extra care
as any edit/update/read view can be restricted to deal with slices of your physical data from the tables only no "cosmetics" pressure arises from end user (or even admin) perspective
In summary your the table subsetting causes lots of issues and effort but does add low if not no value.
Btw. databases are optimized for data storage. Millions of rows and some dozens of columns are no brainers at that end.

Polyglot persistece with a graph database for relationships is a good ideia?

I would like to know if worth the idea of use graph databases to work specifically with relationships.
I pretend to use relational database for storing entities like "User", "Page", "Comment", "Post" etc.
But in most cases of a typical social graph based workload, I have to get a deep traversals that relational are not good to deal and involves slow joins.
Example: Comment -(made_in)-> Post -(made_in)-> Page etc...
I'm thinking make something like this:
Example:
User id: 1
Query: Get all followers of user_id 1
Query Neo4j for all outcoming edges named "follows" for node user with id 1
With a list of ids query them on the Users table:
SELECT *
FROM users
WHERE user_id IN (ids)
Is this slow?
I have seen this question Is it a good idea to use MySQL and Neo4j together?, but still cannot understand why the correct answer says that that is not a good idea.
Thanks
Using Neo4j is a great choice of technologies for an application like yours, that requires deep traversals. The reason it's a good choice is two-fold: one is that the Cypher language makes such queries very easy. The second is that deep traversals happen very quickly, because of the way the data is structured in the database.
In order to reap both of these benefits, you will want to have both the relationships and the people (as nodes) in the graph. Then you'll be able to do a friend-of-friends query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof
and a friend-of-friend-of-friend query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->()->[:friend]->fofof
RETURN john, fofof
...and so on. (Same idea for posts and comments, just replace the name.)
Using Neo4j alongside MySQL is fine, but I wouldn't do it in this particular way, because the code will be much more complex, and you'll lose too much time hopping between Neo4j and MySQL.
Best of luck!
Philip
In general, the more databases/systems/layers you've got, the more complex the overall setup and operating will be.
Think about all those tasks like synchronization, export/import, backup/archive etc. which become quite expensive if your database(s) grow in size.
People use polyglot persistence only if the benefits of having dedicated and specialized databases outweigh the drawbacks of having to cope with multiple data stores. F.e. this can be the case if you have a large number of data items (activity or transaction logs f.e.), each related to a user. It would probably make no sense to store all the information in a graph database if you're only interested in the connections between the data items. So you would be better off storing only the relations in the graph (and the nodes have just a pointer into the other database), and the data per item in a K/V store or the like.
For your example use case, I would go only for one database, namely Neo4j, because it's a graph.
As the other answers indicate, using Neo4j as your single data store is preferable. However, in some cases, there might not be much choice in the matter where you already have another database behind your product. I would just like to add that if this is the case, running neo4j as your secondary database does work (the product I work on operates in this mode). You do have to work extra hard at figuring out what functionality you expect out of neo4j, what kind of data you need for it,how to keep the data in sync and the consequence of suffering from not always real time results. Most of our use cases can work with near real time results so we are fine. Bit it may not be the case for your product. Still, to me , using neo4j in this mode is still preferable than running without it.
We are able to produce a lot of graphy-great stuff as a result of it.

SimpleDB Select VS DynamoDB Scan

I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you

Relational vs Non-Relational Data Modeling - what's the difference

I'm new to databases and I've never worked with any RDBMS. However I get the basic idea of relational databases. At least I think I do ;-)
Let's say I have a user database with the following properties for each user:
user
id
name
zip
city
In a relational database I would for example model it in a table called user
user
id
name
location_id
and have a second table called location
location
id
zip
city
And location_id is a foreign key (reference) to an entry in the location table. If I understand it right the advantage is here, if the zip code for a certain city changes I only have to change exactly one entry.
So, let's go to the non-relational database, where I started to play around with Google App Engine. Here I would really model it like it was written down first in the specifications. I have a kind user:
class User(db.Model):
name = db.StringProperty()
zip = db.StringProperty()
city = db.StringProperty()
The advantage is that I don't need to join two "tables", but the disadvantage is, that if the zip code changes I have to run a script that goes through all user entries and updates the zip code, correct?
So, now there is another option in Google App Engine, which is to use ReferenceProperties. I could have two kinds: user and location
class Location(db.Model):
zip = db.StringProperty()
city = db.StringProperty()
class User(db.Model):
name = db.StringProperty()
location = db.ReferenceProperty(Location)
If I'm not wrong I now have exactly the same model as in the relational database described above. What I'm wondering now is, first of all, is that wrong what I just did and does that destroy all the advantages of a non-relational database. I understand, that in order to get the value of zip and city I have to run I second query. But in the other case, to make a change in the zip code I have to run through all existing users.
So what are the implications of these two modeling possibilities in a non-relational database like Google's datastore. And what are typical use cases for both of them, meaning when should I use one and when the other.
Also as an additional question, if in a non-relation database I can model exactly the same what I can model in a relational database, why should I use a relational database at all?
Sorry if some of these questions sound naive, but I'm sure they will help a couple people, who are new to database systems to get a better understanding.
In my experience, the biggest difference is that non-relational datastores force you to model based on how you'll query, because of the lack of joins, and how you'll write, because of the transaction restrictions. This of course results in very denormalized models. After a while, I started to define all the queries first, to avoid having to rethink the models later.
Because of the flexibility of relational db's, you can think about each data family in separate, create relations between them and in the end query how you wish (abusing joins in so many cases).
Imagine that GAE has two modes for the Datastore: RDMS-mode and non-RDMS-mode.
If I take your ReferenceProperty example with the aim of "list all the users and all their zip codes" and write some code to print all of these.
For the [fictional] RDMS-mode Datastore it might look like:
for user in User.all().join("location"):
print("name: %s zip: %s" % (user.name, user.location.zip))
Our RDMS system has handled the de-normalisation of the data behind the senes and done a nice job of returning all the data we needed in one query. This query did have a little bit of overhead as it had to stitch together our two tables.
For the non-RDMS Datastore our code might look like:
for user in User.all():
location = Location.get( user.location )†
print("name: %s zip: %s" % (user.name, location.zip))
Here the Datastore cannot help us join our data, and we must make an extra query for each and every user entity to fetch the location before we can print it.
This is in essence why you want to avoid overly normalised data on non-RDMS systems.
Now, everybody logically normalizes their data to some extent wether they are using RDMS or not, the trick is to find the trade off between convenience and performance for your use case.
† this is not valid appengine code, I'm just illustrating that user.location would trigger a db query. Also no-one should write code like my extreme example above, you can work around the continued fetching of related entities by say fetching locations in batches upfront.
if in a non-relation database I can model exactly the same what I can model in a relational database, why should I use a relational database at all?
relational-DB's excel at storing thousands-and-millions of rows of complex inter-related models of data, and allowing you to perform incredibly intricate queries to reform and access that data.
non-RDB's excel at storing billions+ rows of simple data and allowing you to fetch that data with simpler queries.
The choice should lie with your use-case really. The simpler structure of the non-relational model and design restraints that come with it is one of the main ways that AppEngine is able to promise to scale your app with demand.
Your understanding of the concept of the relational database is flawed. Relational databases organize their data in relations which contain a set of tuples of the same type. To rephrase, data is stored in tables with each row containing the same number of fields with the same types in the same order.
The example you provided which utilizes a foreign key demonstrates database normalization. This is a concept that can apply to relational as well as other types of databases.
Sorry, I can't answer your questions about Google's storage system, but hopefully this will clarify your understanding enough to find out.

Are there ORM (OKM) for key-value stores?

Object-Relational-Mappers have been created to help applications (which think in terms of objects) deal with stored data in a more application-friendly way like every other class/object.
However, I have never seen a OKM (Object-Key/Value-Mapper) for NoSQL "Key/Value" storage systems. Which seems odd because the need should be far greater given the fact that more value-relations will have to be hard-coded into the app than a regular, single SQL table row object.
four requests:
user:id
user:id:name
user:id:email
user:id:created
vs one request:
user = [id => ..., name => ..., email => ...]
Plus you must keep track of "lists" (post has_many comments) since you don't have has_many through tables or foreign keys.
INSERT INTO user_groups (user_id, group_id) VALUES (23, 54)
vs
usergroups:user_id = {54,108,32,..}
groupsuser:group_id = {23,12,645,..}
And there are lots more examples of the added logic that an application would need to replicate some basic features that normal relational databases use. All of these reasons make the idea of a OKM sound like a shoe-in.
Are there any? Are there any reasons there are not any?
Ruby's DataMapper project is an ORM and will happily talk to a key-value store through the use of an adapter.
Redis and MongoDB have adapters that already exist. CouchDB has an adapter — it's not maintained, but at one point it worked pretty well. I don't think anyone's done anything with Cassandra yet, but there's no reason it couldn't be done. The Dubious framework for Google App Engine takes a very similar approach to Data Mapper to make the Data Store available to applications.
So it's very possible to do ORM with key-value stores. The ORM just really needs to avoid the assumption that SQL is its primary vocabulary.
One of the design goals of SQL is that any data can be stored/queried in any relational database - There are some differences between platforms, but in general the correct way to handle a particular data structure is well known and easily automated but requiring fairly verbose code. That is not the case with NoSQL - generally you will be directly storing the data as used in your application rather than trying to map it to a relational structure, and without joins or other object/relational differences the mapping code is trivial.
Beyond generating the boilerplate data access code, one of the main purposes of an ORM is abstraction of differences between platforms. In my experience the ability to switch platforms has always been purely theoretical, and this lowest common denominator approach simply won't work for NoSQL as the platform is usually chosen specifically for capabilities not present on other platforms. Your example is only for the most trivial key value store - depending on your platform you most likely have some useful additional commands, so your first example could be
MGET user:id:name user:id:email ... (multiget - get any number of keys in a single call)
GET user:id:* (key wildcards)
HGETALL user:id (redis hash - gets all subkeys of user)
You might also have your user object stored in a serialized form - unlike in a relational database this will not break all your queries.
Working with lists isn't great if your platform doesn't have support built in - native list/set support is one of the reasons I like to use redis - but aside from potentially needing locks it's no worse than getting the list out of sql.
It's also worth noting that you may not need all the relationships you would define in sql - for example if you have a group containing a million users, the ability to get a list of all users in a group is completely useless, so you would never create the groupsuser list at all and rather than a seperate usergroups list have user:id:groups as a multivalue property. If you just need to check for membership you could set up keys as usergroups:userid:groupid and get constant time lookup.
I find it helps to think in terms of indexes rather than relationships - when setting up your data access code decide which fields will need to be queried and adding appropriate index records when those fields are written.
ORMs don't map terribly well to the schema-less nature of key-value stores. That being said, if you're using Riak and Ruby, you could take a look at Ripple. There are a number of other drivers for Riak which might fit with your language.
If you're looking into MongoDB (more of a document store than a k/v store), there are a number of drivers available.
The UNIVERSE db , which is a descendent of Pick, lets you store a list of key value pairs for a given key. However this is very old technoligy and the world ran away from these databases a long time ago.
You can implement this in an SQL database with a three column table
CREATE TABLE ATTRS ( KEYVAL VARCHAR(32),
ATTRNAME VARCHAR(32),
ATTRVAR VARCHAR(1024)
)
Although most DBAs will hit you over the head with the very thick Codd and Date hardback edition if you propose this, it is in fact a very common pattern in packaged applications to allow you to add site specific attributes to a system.
To prarphrase Richrd Stallmans comments on LISP.
"Any reasonably functional datastorage system will eventually end up implementing there own version of RDBMS."

Resources