Let's say I want to display a list of books and their authors. In traditional database design, I would issue a single query to retrieve rows from the Book table as well as the related Author table, a step known as eager fetching. This is done to avoid the dreaded N+1 select problem: If the Author records were retrieved lazily, my program would have to issue a separate query for each author, possibly as many queries as there are books in the list.
Does Google App Engine Datastore provide a similar mechanism, or is the N+1 select problem something that is no longer relevant on this platform?
I think you are implicitly asking if Google App Engine supports JOIN to avoid the N+1 select problem.
Google App Engine does not support JOIN directly but lets you define a one to many relationship using ReferenceProperty.
class Author(db.Model):
name = db.StringProperty()
class Book(db.Model):
title = db.StringProperty()
author= db.ReferenceProperty(Author)
In you specific scenario, with two query calls, the first one to get the author:
author = Author.all.filter('name =' , 'fooauthor').get()
and the second one to find all the books of a given author:
books = Book.all().filter('author=', author).fetch(...)
you can get the same result of a common SQL Query that uses JOIN.
The N+1 problem could for example appear when we want to get 100 books, each with its author name:
books = Book.all().fetch(100)
for book in books:
print book.author.name
In this case, we need to execute 1+100 queries, one to get the books list and 100 to dereference all the authors objects to get the author's name (this step is implicitly done on book.author.name statement).
One common technique to workaround this problem is by using get_value_for_datastore method that retrieves the referenced author's key of a given book without dereferencing it (ie, a datastore fetch):
author_key = Book.author.get_value_for_datastore(book)
There's a brilliant blog post on this topic that you might want to read.
This method, starting from the author_key list, prefetches the authors objects from datastore setting each one to the proper entity book.
Using this approach saves a lot of calls to datastore and practically * avoids the N+1 problem.
* theoretically, on a bookshelf with 100 books written by 100 different authors, we still have to call the datastore 100+1 times
Answering your question:
Google App Engine does not support
eager fetching
There are techniques (not out of the box) that
helps to avoid the dreaded N+1
problem
Related
I have two collections: movies collection and comments collection, I want users to be able to post comments about a movie.
I can either have any movie contain an array which contains the id's of each comment or I can have any comment contain the id of the movie to whom it belongs. What are the downsides and advantages of each method?
This is more of a theoretical question. so lets assume that comments are too large and cannot be embedded into the movies collection.
This question is difficult to answer. In NoSQL DB (like your "mongodb" used tag indicates your are using it), the choice of using two collections, OR a collection with embedded comment's _id in an array, OR one single collection with embedded comments information really depends on your use cases.
With SQL database you can create a movie table and a comment table, with movie's id in comment element.
With nosql, you have to choose regarding your use cases : is your page displaying a movies list first with associated comments ? do you have a page which is listing last comments whatever movie ? You have also to integrate technical requirements/restrictions in your reflexion. Example, with mongodb you have a main restriction :
BSON Document Size - The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot
use excessive amount of RAM or, during transmission, excessive amount
of bandwidth. To store documents larger than the maximum size, MongoDB
provides the GridFS API. See mongofiles and the documentation for your
driver for more information about GridFS.
Check https://docs.mongodb.com/manual/reference/limits/ for more precisions.
My first reflexion regarding your needs and my global representation of what you want to do with your app is regarding the following use case :
A page is listing all movies (you can eventualy filter on different movie's flags). So, your entry point is a movie, not a comment. A comment is related to only one movie, a comment is not for more than one movie.
For each movie, an user can display associated comments and add a new comment.
For this use case, a performant db organisation is : One single collection for movies. A movie embed a list of comments, directly embedded in an array of JSON objects, like :
{
"_id":"m001",
"title":"Movie1",
"synopsis":"A young girl want to learn chess and becomes the best player in the world, his name: Beth harmone",
"comments":[
{
"_id":"c001",
"title":"Good movie",
"commentText":"This is a very good movie"
},
{
"_id":"c002",
"title":"Annoying movie",
"commentText":"This is a very annying movie"
}
]
}
You don't need to create another collection to store comments, you will loose reactivity, because of joining from movie another collection comment. BUT, this is a good choice only if you think each of your whole movies element will not be bigger than 16MB (you can also integrate GridFS API as indicated by MongoDB doc, but not the subject here...).
Alternatively, IF you think millions and millions of comments, with lot of information, can be added to a single movie, you will be blocked by technical limitation. In this case, it is better to split into two collections, with it, the technical limitation will not hurt you : each comment will be an element on "comment" collection and will certainly not reach 16MB.
Ffinally, noSQL DB performances can be really really better than SQL DB but you have to design your DB model regarding your use case.
I hope to be clear.
Useful links :
https://www.mongodb.com/basics/embedded-mongodb
https://fosterelli.co/collections-and-embedded-documents-in-mongodb (particularly "Example: comments on a blog" which seems to be your use case)
I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.
You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.
well, i have this line of code in the tutorial i am following. However, it did not provided me the clear explanation regarding recursive. I am a newbie in cakephp and searched about this "recursive". I hope somebody could provide me a layman's explanation of this code:
$this->Author->recursive = 1;
Thank you
First result on Google is a clear explanation from the reference of Cakephp itself:
http://book.cakephp.org/2.0/en/models/model-attributes.html#recursive
It is needed to set the depth of the retrieval of records associated with a model data so that you can limit how much data is fetched from the query when there are many levels of associations between your models.
I would recommend that you check the documentation first.
Recursive defines the amount of data that will be fetched from the database, Cakephp by default will get the data of the Model/Table that you're querying for and the data of the Models/Tables that are linked to the main Model/table (hasmany, belongsto, etc.)
By setting recursive, you're forcing Cakephp to only fetch a certain amount of data, it can be more or less, depending on how much deep are the association between the models/tables and the number specified in recursive.
Setting recursive to -1 will only get the data of the model that you're querying for, setting it higher will ask Cakephp to fetch deeper association.
Lets say that in our app we have authors that sell books and they get commented by readers.
Author 1 <> * Book 1 <> * Comment
If we don't set recursive while fetching the list of authors, Cakephp will get the list of authors their books and comments.
$authors = $this->Author->find('all');
The problem is that for each list display, Cakephp and the database are dealing with a lot of unnecessary data ! which in return impact the performance of your http & database server.
Imagine that the list is shown 10/s and each list shows 20 authors (authors who can have from 1 book to *, lets say 10 books as an average number for this example with 5 comments each) do the math and you will see that the servers are processing a lot of unnecessary data which wont be used in the end.
The user want to only see the authors list, so there's no need to fetch all the books and comments unless you're going to process them in the controller or to display them in the views. We can do so by setting recursive to -1.
$this->Author->recursive = -1;
$authors = $this->Author->find('all');
You may want to optimize your queries so it fetches only the fields that you're going to use, it will boost the overall performance, but that's another subject.
Sometimes you will find yourself wanting to do the reverse of that : lets say that the app update the Auth Session variable whenever the user log-in (update ip, browser info, oauth token, group info etc.) and that the app use all the user relatives info to adapt the user experience, for example if the user belongs to a certain group shows relative info&options to that particular group, if the user has allowed the app to access certain account info of a third party provider (google ?) show services that uses that kind of data - lets say show google+ feed or something - etc.
It would be a lot easier to fetch all the relative info of the user once he's logged in and store it in Session, which in return will be used by views to adapt the user experience. One way of doing so would be to fetch the relative data one by one and storing it in Session or simply set recursive to 2 and store the result in Session, it will fetch all the relative data of the user model.
OLd response
recursive allow you to define the amount of data to get from the database. Lets say that the Author has many publication.
if you specify -1 for recursive before getting a certain author from the database like so:
$this->Author->recursive = -1;
$author = $this->Author->findByName('Someone');
you would get only the Author information/you will get information only from the the Authors table and none from the related tables like publications.
you can see this by yourself by using this code:
//only author info
$this->Author->recursive = -1;
$author = $this->Author->findByName('Someone');
//display the result
debug($author);
//get the author and related publications info
$this->Author->recursise = 1;
$authorAndPublications = $this->Author->findByName('Someone');
//display result
debug($authorAndPublications);
exit;
The recursive property then specify how much information do you want from your database.
where should i use it ?
lets suppose each author has at least 10 publications and you want to query the database to find the authors, if you didn't specify the recursive property, Cakephp will get all the authors and their publications too!! so lets say 50 authors * 10 publications..... you get the picture, you are querying for a ton of unnecessary data.
it mater a lot if it is a high traffic site since for example at each authors list display you query for 500 unnecessary publications informations(that wont be used) just to display some information of the 50 authors in a list/table.
by using recursive = -1; before querying for the authors you ease the strain on the database which result in better reactivity and performance.
From the documentation v1.3, v2.0:
The recursive property defines how deep CakePHP should go to fetch associated model data via find(), findAll() and read() methods.
Imagine your application features Groups which belong to a domain and have many Users which in turn have many Articles. You can set $recursive to different values based on the amount of data you want back from a $this->Group->find() call:
...documentation of the levels omitted...
Set it no higher than you need. Having CakePHP fetch data you aren’t going to use slows your app unnecessarily. Also note that the default recursive level is 1.
I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.
My reading is limited as of yet, but so far here are some key points I have identified for using the GAE Datastore:
It is not a relational database.
Data duplication occurs by default across storage space.
You cannot 'join' tables at the datastore level.
Optimized for reads with less frequent writes.
These lead me to the following data model for a Blog System:
Blogs have a relatively known set of 'columns': id, date, author, content, rating, tags. The Datastore allows for additional columns as desired, but it is known that the likelihood of adding additional columns on the fly will be rare as it requires more backend specialized coding as well as more thought to the entire blog system.
What Blogs do not have are a set number of Comments and Tags. In a traditional relational db structure, these are mapped through a join. Since these are not possible in GAE, I have thought about implementing the following:
Articles -> ID, Author, Date, Title, Content, Rating, Tags
Comments -> Article_ID, Author, Date, Content, Rating
Tags -> Tag, Article IDs
Example:
Article-
1 - Administrator - 01/01/2011 - Questions? - Answers… - 5 - questions, answers, speculations, ruminations
2 - Administrator - 01/05/2011 - Who knows? - Not me! - 10 - questions
Comments-
1 - John Smith - 01/02/2011 - Stupid, stupid, stupid.. - 0
1 - Jane Doe - 01/03/2011 - Smart, smart, smart.. - 5
Tags-
questions - 1, 2
answers - 1
speculations - 1
ruminations - 1
Now, this is my reasoning. When browsing a Blog you do so by: Date, Author, Tag / Topic, Rating, Comments, etc. Date, Author, and Rating are static and so can easily reside in a single table along with the article in question.
Tags are duplicated between the tags 'table' and the article 'table', but consistency here is handled at the application level and the tags are left in to eliminate a join at the application level when sending articles to the viewer. The Tags table is used in order to search by tag. The list of articles is then parsed at application level and then it retrieves these articles through an application call.
The same thing is going to happen with the comments. The join will occur at the application level through an extra method call passing a retrieved article ID.
Now, why do I want to process a join at the application level? I had thought about inserting everything into each article, adding comments as they were created, but got to thinking about the time complexity of sorting and searching once a blog was into the thousands of articles, as well as the limitations on size of returns, not knowing how large articles / comments might become. I haven't tested, but in thinking of the time complexity, I began to conclude that article retrieval would grow linearly to number of articles when attempting to search these articles by tags. Am I correct in this and is this approach a way to overcome that? Also, in general does this data model look like a way to validly implement persistent data storage in GAE?
Thanks,
Trying to wrap my head around it...
Your approach sounds pretty reasonable. Retrieving articles by tag is most easily achieved by having a ListProperty of tags on the article, and filtering on that - which will take time proportional to the number of results returned, not to the number in the datastore - and you're right that you should keep a separate set of 'tag' entities so that you can list all the tags in use separately.
You may want to check out my series of posts on writing a blogging system on App Engine.