My reading is limited as of yet, but so far here are some key points I have identified for using the GAE Datastore:
It is not a relational database.
Data duplication occurs by default across storage space.
You cannot 'join' tables at the datastore level.
Optimized for reads with less frequent writes.
These lead me to the following data model for a Blog System:
Blogs have a relatively known set of 'columns': id, date, author, content, rating, tags. The Datastore allows for additional columns as desired, but it is known that the likelihood of adding additional columns on the fly will be rare as it requires more backend specialized coding as well as more thought to the entire blog system.
What Blogs do not have are a set number of Comments and Tags. In a traditional relational db structure, these are mapped through a join. Since these are not possible in GAE, I have thought about implementing the following:
Articles -> ID, Author, Date, Title, Content, Rating, Tags
Comments -> Article_ID, Author, Date, Content, Rating
Tags -> Tag, Article IDs
Example:
Article-
1 - Administrator - 01/01/2011 - Questions? - Answers… - 5 - questions, answers, speculations, ruminations
2 - Administrator - 01/05/2011 - Who knows? - Not me! - 10 - questions
Comments-
1 - John Smith - 01/02/2011 - Stupid, stupid, stupid.. - 0
1 - Jane Doe - 01/03/2011 - Smart, smart, smart.. - 5
Tags-
questions - 1, 2
answers - 1
speculations - 1
ruminations - 1
Now, this is my reasoning. When browsing a Blog you do so by: Date, Author, Tag / Topic, Rating, Comments, etc. Date, Author, and Rating are static and so can easily reside in a single table along with the article in question.
Tags are duplicated between the tags 'table' and the article 'table', but consistency here is handled at the application level and the tags are left in to eliminate a join at the application level when sending articles to the viewer. The Tags table is used in order to search by tag. The list of articles is then parsed at application level and then it retrieves these articles through an application call.
The same thing is going to happen with the comments. The join will occur at the application level through an extra method call passing a retrieved article ID.
Now, why do I want to process a join at the application level? I had thought about inserting everything into each article, adding comments as they were created, but got to thinking about the time complexity of sorting and searching once a blog was into the thousands of articles, as well as the limitations on size of returns, not knowing how large articles / comments might become. I haven't tested, but in thinking of the time complexity, I began to conclude that article retrieval would grow linearly to number of articles when attempting to search these articles by tags. Am I correct in this and is this approach a way to overcome that? Also, in general does this data model look like a way to validly implement persistent data storage in GAE?
Thanks,
Trying to wrap my head around it...
Your approach sounds pretty reasonable. Retrieving articles by tag is most easily achieved by having a ListProperty of tags on the article, and filtering on that - which will take time proportional to the number of results returned, not to the number in the datastore - and you're right that you should keep a separate set of 'tag' entities so that you can list all the tags in use separately.
You may want to check out my series of posts on writing a blogging system on App Engine.
Related
I implementing a Comments section for my current application. The Comments section can be thought of as a series of user posts on a given page. I am wondering which design would be most effective in a non-relational database (Google App Engine).
Design 1:
Group the comments by a groupId and filter on those results
Comment Entity >> [id, groupId, otherData...]
Queries for all comments pertaining to a page would look like:
Select from Comments filter by groupId
Design 2:
Store a single key for all comments within a group and use a Self Expanding List if the number of entries exceeds 5000 entries.
Comment Entity >> [id, SELid]
Queries would simply perform an id/key lookup.
I understand that Indexes can be expensive, but the first design proposal will only index the groupId field and will only require a single write to post a comment (well more writes if you include the index).
The second design will avoid costly indexing but each posted comment will require a read and a write operation. Furthermore, I"m worried about contention issues. These comments should not be experiencing extremely high throughput, but the second design seems to create a bottleneck.
As I am new to non-relational DB's, I would appreciate any input on these proposed designs and their associated tradeoffs.
In case of App Engine and Datastore, the approach you will take depends mainly on the consistency model (strong vs eventual) you require for your entities. In Google Cloud Datastore, there is a concept of an entity group. The entity group (an entity and its descendants) is a unit with strong consistency, transactionality, and locality but also imposes some restrictions (1 write per second).
Considerations
Do you require strong consistent results?
How often will be comments posted per page?
How many comments per page do you expect?
Do you have a use case requiring transactional behaviour?
Since neither of your design options uses entity group (page -> posts), I suppose you decided not to go this way.
Design 1
Eventual consistent lookup by groupId
Easier to maintain (you do not have to deal with 5000 entities limit)
Design 2
Strong consistent lookup by entityGroupId
Harder to maintain (you HAVE to deal with 5000 entities limit)
As mentioned, one entity representing all post for a page can be a bottleneck (can be reduced by means of Memcache)
I would probably go with the first approach even though it can resemble relational data model.
Sorry if this question is too simple; I'm only entering 9th grade.
I'm trying to learn about NoSQL database design. I want to design a Google Datastore model that minimizes the number of read/writes.
Here is a toy example for a blog post and comments in a one-to-many relationship. Which is more efficient - storing all of the comments in a StructuredProperty or using a KeyProperty in the Comment model?
Again, the objective is to minimize the number of read/writes to the datastore. You may make the following assumptions:
Comments will not be retrieved independently of their respective blog post. (I suspect that this makes the StructuredProperty most preferable.)
Comments will need to be sortable by date, rating, author, etc. (Subproperties in the datastore cannot be indexed, so perhaps this could affect performance?)
Both blog posts and comments may be edited (or even deleted) after they are created.
Using StructuredProperty:
from google.appengine.ext import ndb
class Comment(ndb.Model):
various properties...
class BlogPost(ndb.Model):
comments = ndb.StructuredProperty(Comment, repeated=True)
various other properties...
Using KeyProperty:
from google.appengine.ext import ndb
class BlogPost(ndb.Model):
various properties...
class Comment(ndb.Model):
blogPost = ndb.KeyProperty(kind=BlogPost)
various other properties...
Feel free to bring up any other considerations that relate to efficiently representing a one-to-many relationship with regards to minimizing the number of read/writes to the datastore.
Thanks.
I could be wrong, but from what I understand, a StructuredProperty is just a property within an entity, but with sub-properties.
This means reading a BlogPost and all its comments would only cost one read. So when you render your page, you only need one read op for your entire page.
Writes would be cheaper each too. You'll need one read op to get the BlogPost, and as long as you don't update any indexed properties, it'll just be one write op.
You can handle the comment sorting on your own after you read the entity out of the datastore.
You'll have to synchronize your comment updates/edits with transactions, to make sure one comment doesn't overwrite another, since they are both modifying the same entity. You may run into unsolveable problems if everyone is commenting and editing the same blog post at the same time.
In optimizing for cost though, you'll hit a wall with the maximum entity size of 1MB. This will limit the number of comments you can store per blog post.
Going with the KeyProperty would be quite a bit more expensive.
You'll need one read to get the blog post, plus 1 query plus 1 small read op for each comment.
Every comment is a new entity, so it'll be at least 4 write ops. You may want to index for sort order, so that'll end up costing even more write ops.
On the plus side, you'll have unlimited comments per blog post, you don't have to worry about synchronizing new comments. You might need to worry about synchronization for editing comments, but if you limit the edit to the creator, that shouldn't really be a problem. You don't have to do sorting yourself either.
It's a cost vs features tradeoff.
What about:
from google.appengine.ext import ndb
class Comment(ndb.Model):
various properties...
class BlogPost(ndb.Model):
comments = ndb.KeyProperty(Comment, repeated=True)
various other properties...
This way, you can store up to 5000 comments per blog post (the maximum number of repeated properties) independent of the size of each blog post. You won't need a query to fetch the blogs for a comment, you can just do ndb.get_multi(blog_post.comments). And for this operation, you can try to rely on ndb's memcache. Of course, it depends on your use case whether this is a good assumption or not.
Be aware of this caveat when using a repeated StructuredProperty:
Do not use repeated properties if you have more than 100-1000 values. (1000 is probably already pushing it.) They weren't designed for such use.
See Guido's answer in GAE ndb design, performance and use of repeated properties.
So while you may not hit the 1 MB entity limit with StructuredProperty, you may easily hit the 100-1000 suggested max.
I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.
Let's say I want to display a list of books and their authors. In traditional database design, I would issue a single query to retrieve rows from the Book table as well as the related Author table, a step known as eager fetching. This is done to avoid the dreaded N+1 select problem: If the Author records were retrieved lazily, my program would have to issue a separate query for each author, possibly as many queries as there are books in the list.
Does Google App Engine Datastore provide a similar mechanism, or is the N+1 select problem something that is no longer relevant on this platform?
I think you are implicitly asking if Google App Engine supports JOIN to avoid the N+1 select problem.
Google App Engine does not support JOIN directly but lets you define a one to many relationship using ReferenceProperty.
class Author(db.Model):
name = db.StringProperty()
class Book(db.Model):
title = db.StringProperty()
author= db.ReferenceProperty(Author)
In you specific scenario, with two query calls, the first one to get the author:
author = Author.all.filter('name =' , 'fooauthor').get()
and the second one to find all the books of a given author:
books = Book.all().filter('author=', author).fetch(...)
you can get the same result of a common SQL Query that uses JOIN.
The N+1 problem could for example appear when we want to get 100 books, each with its author name:
books = Book.all().fetch(100)
for book in books:
print book.author.name
In this case, we need to execute 1+100 queries, one to get the books list and 100 to dereference all the authors objects to get the author's name (this step is implicitly done on book.author.name statement).
One common technique to workaround this problem is by using get_value_for_datastore method that retrieves the referenced author's key of a given book without dereferencing it (ie, a datastore fetch):
author_key = Book.author.get_value_for_datastore(book)
There's a brilliant blog post on this topic that you might want to read.
This method, starting from the author_key list, prefetches the authors objects from datastore setting each one to the proper entity book.
Using this approach saves a lot of calls to datastore and practically * avoids the N+1 problem.
* theoretically, on a bookshelf with 100 books written by 100 different authors, we still have to call the datastore 100+1 times
Answering your question:
Google App Engine does not support
eager fetching
There are techniques (not out of the box) that
helps to avoid the dreaded N+1
problem
I'd like to work on a project, but it's a little odd. I want to create a site that shows lyrics and their translations, but they are shown simultaneously side-by-side (so this isn't just a normal i18n of the site).
I have normalized the tables like this (formatted to show hierarchy).
artists
artistNames
albums
albumNames
tracks
trackNames
trackLyrics
user
So questions,
First, that'll be a whopping seven joins. I must have written pretty small queries in the past because I've never come across something like this. Is joining so many tables a bad thing? I'm pretty sure I'll be using SQLite for this project, but does anyone think PostgreSQL or MySQL could perform better with a pretty big join like this?
Second, my current self-built framework uses a data mapper to create domain objects. This is the first time I will be working with so many one-to-many relationships, so my mapper really only takes one row as one object. For example,
id name
------ ----------
1 Jackie Chan
2 Stephen Chow
So it's super easy to map objects. But with those one to many relationships...
id language name
------ ---------- -------
1 en Jackie Chan
1 zh 陳港生
2 en Stephen Chow
2 zh 周星馳
...I'm not sure what to do. Is looping through the result set to create a massive array and feeding it to my domain object factory the only option when dealing with a data set like this?
<?php
array(
array(
'id' => 1,
'names' => array(
'en' => 'Jackie Chan'
'zh' => '陳港生'
)
),
array(
'id' => 2,
'names' => array(
'en' => 'Stephan Chow',
'zh' => '周星馳'
)
)
);
?>
I have an itch to just denormalize these tables so I can get my one row per object application working, but I've always read this is not the way to go.
Third, does this schema sound right for the job?
Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.
You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.
Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.
Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.
As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.
Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.
General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:
Profile and optimize queries
add indices to schema
add hints to query optimizer (don't know if SQLite has any, but most of databases do)
If 1. does not gain any performance benefits, consider denormalizing database.
Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.
Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.
Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:
select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;
I've helped a friend optimize a web storefront. In your case, it's a lot the same.
First. What is your priority, webpage speed or update speed?
Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.
But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.
But at the heart of your design I have a question.
Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?
It looks like you have a 1-n relationship from artists to albums but that really should many-many.
Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?
If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.