Why maintain two-way pointers in data design? - database

Assume I have the following schema:
BOOKS COLLECTION:
{
author: { ...authorObject }
}
AUTHORS COLLECTION:
{
books: [{ ...bookObject}]
}
If I'm already storing the author's information on each book document, why should I store an array of books on each author document? Wouldn't it suffice to query from the books collection whenever I want to see all the books a particular author has written?
I found a similar schema here: https://www.howtographql.com/graphql-js/6-authentication/
Every link has a "postedBy" field, and each user has a "links" field. Why store the same information in both places? Isn't it inefficient? For instance, if one removes a link (or a book in the example above), you'd have to update the corresponding user's (or author's) document.
Just trying to understand why we need to store the same information in both directions. Feels a bit redundant.

I think it depends on your use cases. Says you only maintain 1-way pointer that keeps books array in author document, if you have a use case that requires searching for author by a certain book, you will need to do a full collection scan of author collection to find all matches. So without much knowledge of your actual scenario, it could be hard for us to comment on the necessity of 2 way pointer in your database.
This document is a good piece for your reading.

Related

One to one relationship vs one to many

I have two collections: movies collection and comments collection, I want users to be able to post comments about a movie.
I can either have any movie contain an array which contains the id's of each comment or I can have any comment contain the id of the movie to whom it belongs. What are the downsides and advantages of each method?
This is more of a theoretical question. so lets assume that comments are too large and cannot be embedded into the movies collection.
This question is difficult to answer. In NoSQL DB (like your "mongodb" used tag indicates your are using it), the choice of using two collections, OR a collection with embedded comment's _id in an array, OR one single collection with embedded comments information really depends on your use cases.
With SQL database you can create a movie table and a comment table, with movie's id in comment element.
With nosql, you have to choose regarding your use cases : is your page displaying a movies list first with associated comments ? do you have a page which is listing last comments whatever movie ? You have also to integrate technical requirements/restrictions in your reflexion. Example, with mongodb you have a main restriction :
BSON Document Size - The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot
use excessive amount of RAM or, during transmission, excessive amount
of bandwidth. To store documents larger than the maximum size, MongoDB
provides the GridFS API. See mongofiles and the documentation for your
driver for more information about GridFS.
Check https://docs.mongodb.com/manual/reference/limits/ for more precisions.
My first reflexion regarding your needs and my global representation of what you want to do with your app is regarding the following use case :
A page is listing all movies (you can eventualy filter on different movie's flags). So, your entry point is a movie, not a comment. A comment is related to only one movie, a comment is not for more than one movie.
For each movie, an user can display associated comments and add a new comment.
For this use case, a performant db organisation is : One single collection for movies. A movie embed a list of comments, directly embedded in an array of JSON objects, like :
{
"_id":"m001",
"title":"Movie1",
"synopsis":"A young girl want to learn chess and becomes the best player in the world, his name: Beth harmone",
"comments":[
{
"_id":"c001",
"title":"Good movie",
"commentText":"This is a very good movie"
},
{
"_id":"c002",
"title":"Annoying movie",
"commentText":"This is a very annying movie"
}
]
}
You don't need to create another collection to store comments, you will loose reactivity, because of joining from movie another collection comment. BUT, this is a good choice only if you think each of your whole movies element will not be bigger than 16MB (you can also integrate GridFS API as indicated by MongoDB doc, but not the subject here...).
Alternatively, IF you think millions and millions of comments, with lot of information, can be added to a single movie, you will be blocked by technical limitation. In this case, it is better to split into two collections, with it, the technical limitation will not hurt you : each comment will be an element on "comment" collection and will certainly not reach 16MB.
Ffinally, noSQL DB performances can be really really better than SQL DB but you have to design your DB model regarding your use case.
I hope to be clear.
Useful links :
https://www.mongodb.com/basics/embedded-mongodb
https://fosterelli.co/collections-and-embedded-documents-in-mongodb (particularly "Example: comments on a blog" which seems to be your use case)

Adding additional data fields to account information in Substrate

Very new to Substrate and Rust. My understanding of the ChainState is that it acts sort of like a database which holds account numbers (in this case public keys) and their associated balances. When making a transaction, Substrate basically checks to see that you have a sufficient balance, and if so, the transaction succeeds. (This is different from the UTXO method used in Bitcoin.)
First of all, if I am wrong on the above, please correct me.
If I am correct (or at least close) I would like to find a method for associating other data with each account. I've noticed that in the demos, accounts are also associated with names, like Alice, Bob, etc. Is this kept in the ChainState, or is this something which would only be stored on one's own node?
I am trying to determine a way to associate additional data with accounts in the ChainState. For example, how could I store a name (like Alice, Bob, etc.) in the ChainState (assuming that they are only stored locally) or even other information, such as the birthday of the account owner, or their favorite author, or whatever arbitrary information?
The Chain State is just the state of everything, not necessarily connected to Account IDs. It does, among other things, store balances and such, yes, but also many other things that the chain stored one way or another.
To add custom data, you would create a new structure (map) and then map account IDs to whatever data you want. As an example:
decl_storage! {
trait Store for Module<T: Trait> as TemplateModule {
/// The storage item for our proofs.
/// It maps a proof to the user who made the claim and when they made it.
Proofs: map hasher(blake2_128_concat) Vec<u8> => (T::AccountId, T::BlockNumber);
}
}
The above declares a storage map which will associate a hash with a tuple of Account and Block Number. That way, querying the hash will return those two values. You could also do the reverse and associate an AccountID with some other value, like a string (Vec<u8>).
I recommend going through this tutorial from which I took the above snippet: it will show you exactly how to add custom information into a chain.
The answer given by #Swader was very good, as it was general in scope. I will be looking into this answer more, as I try to associate more types of information. (I voted it up, but my vote isn't visible because I am relatively new to StackOverflow, at least on this account.)
After a bit more searching I also found this tutorial: Add a Pallet to Your Runtime.
This pallet happens to specifically add the ability to associate a nickname with the account ID, which was the example I gave in my question. #Swader's answer, however, was more general, and therefore both more useful and also more closely answered my question.
By the way, the nicknames are saved as hex encoded, and are returned as hex encoded as well. An easy way to check that the hex encoding is actually equivalent to the nickname which was set is to visit https://convertstring.com/EncodeDecode/HexDecode and paste in the hex string, without the initial 0x.

Mark a post as read on appengine

I'm currently designing an application similar to twitter/jaiku/reddit in structure. Basically there are small posts with upvotes and downvotes, and they are sorted by score and time like reddit.
I've gotten all of this working, but now our requirements have changed a bit, and we need the user to be able to mark a post as 'read'. This would make the post no longer show up in that user's feed. I can model this with a Read entity for each tuple of (User, Post), but this would require a lot of work to find posts which 'do not' exist in that table. Alternatively I can invert that relation so that I have one entity for each unread post, and it becomes much easier to find which posts 'do' exist in the table... But then I'd need to create an entry in this table for every single user everytime a post is made. This would not scale well.
My question is this: How would I model this sort of negative information in appengine's datastore? I'm using the go runtime if that matters, but answers for any runtime are fine.
This would be a many-to-many relationship. This article describes how to model different kinds of relationships, including many-to-many. The only issue is that I'm not sure weather you should store a list of read posts on the user, or a list of users who have read it, on the post, as poth lists might get large in different situations. If posts are relatively private, and not seen by many people, you could store a list of user keys on the post model. But, if one post could be seen by thousands of people, it might be better to store a list of posts on the users, as there wil probably not be many users with thousands of read posts. Another option might be to discard old posts, or just discard their read state.

What choices for a relational Document-store (NoSql?) database engine?

What choices are there for document-store databases that allow for relational data to be retrieved? To give a real example, say you have a database to store blog posts. I'd like to have the data look something like:
{id: 12345,
title: "My post",
body: "The body of my post",
author: {
id: 123,
name: "Joe Bloggs",
email: "joe.bloggs#example.com"
}
}
Now, you will likely have a number of these records that all share the author details. What I'd really like is to have the author itself stored as a different record in the database, so that if you update this one record every post record that links to it gets the updates as well. To date the only way I've seen mentioned to do this is to have the post record instead store an ID of the author record, so that the calling code will have to make two queries of the data store - one for the post and another for the author ID that is linked to the post.
Are there any document store databases that will allow me to make a single query and return a structured document containing the linked records? And preferably allow me to edit an internal part of the document, persist the document as a whole and have the correct thing happen [I.e. in the above, if I retrieved the entire document, changed the value of email and persisted the entire document then the email address of the author record is changed, and reflected in all posts that have that author...]
First, let me acknowledge: This particular type of data is somewhat relational by nature. It just depends on exactly how you want to structure this type of data, and what technologies that you have easy access to for this particular project. That said, how do you want your data structured?
If you can structure your data any way you want, you could go with something like this:
{
name: 'Joe',
email: 'joe.bloggs#ex.com',
posts: [
{
id: 123,
title: "My post"
},
{..}
]
}
Where all the posts were contained in one particular key/value pair. This particular type of data I would say is uniquely suited for Riak (due to it being able to query internally against JSON using JavaScript natively). Though you could probably come at it from just about any of the NoSQL data store point of views (Cassandra, Couch, Mongo, et al..), as most of them can store straight up JSON. I just have a tendency towards Riak at this point, due to my personal experience with it.
The more interesting things that you'll probably run up against will relate to how you deal with the data store. For instance, I really like using Ripple for Ruby, which lets me deal with this kind of data in Riak real easy. But if you're in Java land, that might make adoption of this technique a bit more difficult (though I haven't spent a lot of time looking in to Java adoption of Riak), since it tends to lag on 'edge' style data storage techniques.
What is more than that, getting your brain to start thinking in NoSQL terms, or without using 'relations' is what usually takes the longest in structuring data. Because there isn't a schema, and there aren't any preconceptions that come with it, that means that you can do a lot of things that are thought of as simply wrong in the relational DB world. Like storing all of the blog posts for a single user in one document, which just wouldn't work in the standard schema-heavy strongly table based relational world.

Does GAE Datastore support eager fetching?

Let's say I want to display a list of books and their authors. In traditional database design, I would issue a single query to retrieve rows from the Book table as well as the related Author table, a step known as eager fetching. This is done to avoid the dreaded N+1 select problem: If the Author records were retrieved lazily, my program would have to issue a separate query for each author, possibly as many queries as there are books in the list.
Does Google App Engine Datastore provide a similar mechanism, or is the N+1 select problem something that is no longer relevant on this platform?
I think you are implicitly asking if Google App Engine supports JOIN to avoid the N+1 select problem.
Google App Engine does not support JOIN directly but lets you define a one to many relationship using ReferenceProperty.
class Author(db.Model):
name = db.StringProperty()
class Book(db.Model):
title = db.StringProperty()
author= db.ReferenceProperty(Author)
In you specific scenario, with two query calls, the first one to get the author:
author = Author.all.filter('name =' , 'fooauthor').get()
and the second one to find all the books of a given author:
books = Book.all().filter('author=', author).fetch(...)
you can get the same result of a common SQL Query that uses JOIN.
The N+1 problem could for example appear when we want to get 100 books, each with its author name:
books = Book.all().fetch(100)
for book in books:
print book.author.name
In this case, we need to execute 1+100 queries, one to get the books list and 100 to dereference all the authors objects to get the author's name (this step is implicitly done on book.author.name statement).
One common technique to workaround this problem is by using get_value_for_datastore method that retrieves the referenced author's key of a given book without dereferencing it (ie, a datastore fetch):
author_key = Book.author.get_value_for_datastore(book)
There's a brilliant blog post on this topic that you might want to read.
This method, starting from the author_key list, prefetches the authors objects from datastore setting each one to the proper entity book.
Using this approach saves a lot of calls to datastore and practically * avoids the N+1 problem.
* theoretically, on a bookshelf with 100 books written by 100 different authors, we still have to call the datastore 100+1 times
Answering your question:
Google App Engine does not support
eager fetching
There are techniques (not out of the box) that
helps to avoid the dreaded N+1
problem

Resources