Data storage: "grouping" entities by property value? (like a dictionary/map?)

Data storage: "grouping" entities by property value? (like a dictionary/map?) - database

Using AppEngine datastore, but this might be agnostic, no idea.
Assume a database entity called Comment. Each Comment belongs to a User. Every Comment has a date property, pretty standard so far.
I want something that will let me: specify a User and get back a dictionary-ish (coming from a Python background, pardon. Hash table, map, however it should be called in this context) data structure where:
keys: every date appearing in the User's comment
values: Comments that were made on date.
I guess I could just iterate over a range of dates an build a map like this myself, but I seriously doubt I need to "invent" my own solution here.
Is there a way/tool/technique to do this?

Datastore supports both references and list properties. This let's you build one-to-many relationships in two ways:
Parent (User) has a list property containing keys of Child entities (Comment).
Child has a key property pointing to Parent.
Since you need to limit Comments by date, you'd best go with option two. Then you could query Comments which have date=somedate (or date range) and where user=someuserkey.
There is no native grouping functionality in Datastore, so to also "group" by date, you can add a sort on date to the query. Than when you iterate over the result, when the date changes you can use/store it as a grouping key.
Update
Designing no-sql databases should be access-oriented (versus datamodel oriented in sql): for often-used operations you should be getting data out as cheaply (= as few operations) as possible.
So, as a rule of thumb you should, in one operation, only get data that is needed at that moment (= shown on that page to user). I'm not sure about your app's design, but I doubt you need all user's full comments (with text and everything) at one time.

I'd start by saying you shouldn't apologize for having a Python background. App Engine started supporting only Python. Using the db module, you could have a User entity as the parent of several DailyCommentBatch entities each a parent of a couple Comment entities. IIRC, this will keep all related entities stored together (or close).
If you are using the NDB (I love it) you may have employ a StructuredProperty either at the User or DailyCommentBatch levels.

Related

Ancestor relation in datastore

I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?

Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.

I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.

GAE Datastore: Normalization?

Normalization not in a general relational database sense, in this context.
I have received reports from a User. The data in these reports was generated roughly at the same time, making the timestamp the same for all reports gathered in one request.
I'm still pretty new to datastore, and I know you can query on properties, you have to grab the ancestors' entity's key to traverse down... so I'm wondering which one is better performance and "write/read/etc" wise.
Should I do:
Option 1:
User (Entity, ancestor of ReportBundle): general user information properties
ReportBundle (Entity, ancestor of Report): timestamp
Report (Entity): general data properties
Option 2:
User (Entity, ancestor of Report): insert general user information properties
Report (Entity): timestamp property AND general data properties

Do option 2:
Because, you save time for reading and writing an additional Entity.
You also save database operations (which in the end will save money).
As I see from your options, you need to check the timestamp property anyhow so putting it inside the report object would be fine,
also your code is less complex and better maintainable.
As mentioned from Chris and in comments, using datastore means thinking denormalized.
It's better to store the data twice then doing complex queries, goal for your data design should be to get the entities by ID.
Doing so will also save on the amount of indexes you may need. This is important to know.
The reason why the amount of indexes is limited, is because of denormalization.
For each index you create, datastore creates a new table in behind, which holds the data in the right order based on your index. So when you use indexes your data is already stored more then one time. The good thing about this behavior is that writes are faster, because you can write to all the index tables in parallel. Also reads, because you read data already in right order based on your index.
Knowing this, and if only these 2 options are available, option 2 would be the better one.

We have lots of very denormalized models because of the inability to do JOINs.
You should think about how you are going to process the data, if you might expect request timeouts.

Clarification: can I put all of a user's data in a single entity group by making up an ancestor key?

I want to do several operations on a user's data in a single transaction, but won't need to update multiple users' data in a single transaction. I see from http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html#Entity_Groups_Ancestors_and_Paths that "A good rule of thumb for entity groups is that [entity groups] should be about the size of a single user's worth of data or smaller," so I think the correct choice is to use a single parent key when building the keys for the other entities related to a user.
Does this seem like a good idea?
Is it easy to code? Something like KeyBuilder.setParent(theKeyOfMyUserEntity)?

1) It is hard to comment without some addition details about the data. There are several things you should be aware of with entity groups; the biggest is that the group will be stored together. That means if you are trying to do many (separate) updates you could face contention, limiting your app's performance.
2) yes it is easy to code. The syntax is pretty close to what you posted.
There are other options for transactions. Check out Nick Johnson's article on distributed transactions. If you are wanting transactions for aggregates you should also check out Brett Slatkin's IO talk on high-throughput data pipelines.

Yes, it seems reasonable to store some user data as child entities of a User entity.
Why do you need to manually create keys ? The db.Model() constructor already has a convenient "parent" argument which will automatically put both the parent entity and the child entity in the same entity group.

App Engine - Chain of entities generates exceptionally long entity keys

I am writing an application which allows users to send messages between them. I am using transactions to ensure that there is only a single "top" message between any two users, and this "top" message has a link to the "next" message, and so on.. forming a sort of of linked list of messages. The messages reference each other through reference properties, and are placed in the same entity group by declaring each new "top" a having the previous "top" as its parent.
However, the problem with this approach is that each new entity has a key that includes the entire key of the previous entity (ie: new_top_key == old_top_key + new_stuff). This results in entity keys growing at a large rate, and probably very bad behavior after a few hundred messages in a single chain (but I haven't actually tested).
So, my question is: 1) Is this an intentional feature of the App Engine. 2) Should I be avoiding this type of a structure -- or is it somehow efficiently dealt with by the App Engine internally? 3) Do you have any suggestions on what is the correct approach for a linked-list-of-entities type of structure?
Thank you and kind regards
Alex

In order:
Yes. Each entity is uniquely identified by its kind, key or id, and those of all its parents, which means that the entire chain is necessary to identify an entity.
Yes. Instead, have a "conversation" entity (which could be the first message, as well), which is a direct parent of all the posts. If you still need to maintain parent/child relationships within a conversation (instead of just ordering them by timestamp, for example), declare an explicit SelfReferenceProperty.
See #2, above.

Are you using python or java? The detailed answer will depend a bit on which API you are using.
I'm pretty sure that having your keys grow indefinitely is not the best plan. (it might be a good test case for the app engine api though :)
I think the solution will be to separate the entity group information from the message linking information. In order to do transactions on a thread/conversation/chain/whatever, all your messages need to be in the same entity group. However, they do not need to be in a hierarchy that exactly matches the structure of the links between messages. You should explicitly set the parent (entity group) of all your message entities to be the same, in a flat structure. So each entity would be a sibling of the others, in the sense of entity groups. You would also need a field in your entity to link to the next (and/or previous) message. So you would still have a linked list (or tree or whatever) in terms of the "previous message" links.
Both Python and Java have methods for creating an entity with a specific parent/entity group. (In fact, you can even specify a nonexistent entity to be the root of an entity group hierarchy!)
Now the key of each message will be a fixed length, so your "next" and "previous" reference properties will be nice and safe from overflowing some limit on key length.

Creating a Notifications type feed in GAE Objectify

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.

There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.

What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight