Creating a Notifications type feed in GAE Objectify - google-app-engine

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.

There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.

What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Related

Ancestor relation in datastore

I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.

How to implement user feed like in Twitter or Facebook on redis

I'm going to write simple news site on redis with supporting followers.
I can't imagine how can I organize users timeline like in twitter. I read about Retwis ( http://redis.io/topics/twitter-clone ), but its feed creating method seems stupid. What if I want to remove entries? I'll should to remove all entry references from followers feeds. What if I already do not follow some users?
There are several ways to attack what you describe using a bit of imagination, here are some examples that address your questions:
What if I want to remove entries?
One could mantain a set such as post:$postid:users for each post, holding all the userids that may have the post in their feeds; when the post is to be deleted one just has to extract all members from this set and iterate through the ids to remove it from each uid:$userid:posts set; speaking of which you would have to turn that last one into a set instead of a list like the original article suggests in order to be able to extract and remove individual items but that is trivial, the logic is pretty similar.
What if I already do not follow some users?
When the feed is being generated for each individual user you have to necessarily iterate and read each post:$postid key, from which you have access to the author userid; so before showing the post you read this id and look it up in the uid:$userid:following set, if it's there we show the post, if it's not we delete it from uid:$userid:posts and don't show it.
In a nutshell, this is what you have to keep in mind in order to build this kind of logic in redis:
You'll need many commands, but that's ok, Redis is supposed to be fast enough to handle it well.
Data will repeat, but that is also ok; it may look insane for someone with a relational DBMS background to store a set of users for each post if each user already has a set with their posts, but this is the only way around building relationships in a non-relational data store like redis.
Generally speaking think of sets and sorted sets when designing something relational in Redis.
With redis you get to do everything yourself, but once you get your head around it it's actually pretty powerful.

Delete data or just flag it as deleted?

I'm building a website that lets people create vocabulary lessons. When a lesson is created, a news items is created that references the lesson. When another user practices the lesson, the user also stores a reference to it together with the practice result.
My question is what to do when a user decides to remove the lesson?
The options I've considered are:
Actually delete the lesson from
the database and remove all
referencing news items, practise
results etc.
Just flag it as deleted and
exclude the link from referencing
news items, results etc.
What are your thoughts? Should data never be removed, ala Facebook? Should references be avoided all together?
By the way, I'm using Google App Engine (python/datastore). A db.ReferenceProperty is not set to None when the referenced object is deleted as far as I can see?
Thanks!
Where changes to data need to be audited, marking data as deleted (aka "soft deletes") helps greatly particularly if you record the user that actioned the delete and the time when it occurred. It also allows data to be "un-deleted" very easily.
Having said that there is no reason to prevent "hard deletes" (where data is actually deleted) as an administrative function to help tidy up mistakes.
Marking the data as "deleted" is simplest. If you currently have no use for it, this keeps everything in your database very tidy and makes it easy to add new functionality.
On the other hand, if you're doing something like showing the user where their "vocabulary points" came from, or how many lessons they've completed, then the reference to soft deleted items might be necessary.
I'd start with the first one and change it later if you need to. Here's why:
If you're not using soft deletes, assume they won't work in the way that future requests actually want them to. You'll have to rewrite them anyway.
If you are using them, assume that nobody is using the feature which uses them. Now you've done a lot of work and tied yourself into maintenance of something nobody cares about.
If you create them, you'll find yourself creating a feature to use them. See the above.
If you don't create them, you can always create them later, once you have better knowledge about what the users of your system really want.
Not creating soft deletes gives you more options going forward. Options have value. Options expire. Never commit early unless you know why.

Viewing a "log" of new (distinct) events in a database

I have an application which has several unrelated tables in its db. I'll explain by using an "auto-updating" version of the SO homepage as an example, so lets say I have the tables "users", "comments" and "questions".
The homepage client side needs to periodically poll the server, and get a log of all the new "events" that have happened. I.e., I'd like to display (somehow) the new questions, comments and users that have been added to SO since the last poll.
On way would be to simply keep a variable on the client side containing the last index of each of my tables, send it to the server, and have the server send me the new users, comments and questions.
The problem is, what happens when I add a new type of information, say, votes. Now I have to store another variable on the client-side, and the server has to poll another table. And so on, for every new type of information I keep.
I'm looking for a solution that helps me avoid this.
Another problem - say I'd like to see all the "events" that have happened since last time, but sorted according to when they took place.
One direction I had is to have a single "events" table, which contains the info about when each event happened. I can then poll only this table, and get a list of all the new events that have happened. The problem is that each event is pretty different (a new comment has different columns than a new upvote, etc.) So I'm not sure how to implement this, or if this is even a good idea.
Does anybody have any ideas how I can solve this? This seems like something that would come up a lot, but I don't really have much experience with databases, unfortunately.
Thanks!
It sounds to me like you're trying to future proof via database design. While this can be done through something an EVA model I caution against that because the value its adds tend to not be worth the cost.
Instead you should model the database as closely to reality as possible and not how you intend to use it.
Then use SQL to project the data to how you need it. You can do this by statements that will either deliver the meta data that you need
e.g.
Select
Count(ID) , 'Comments' Type
From
Comments
Where
lastUpdate > #InputParamter1#
UNION Select
Count(ID) , 'Questions' Type
From
Questions
Where
lastUpdate > #InputParamter1#
Or (and this doesn't get used Often enough)
Return more than one result set from your database in one go
Select
userid,
ComentText
From
Comments
Where
lastUpdate > #InputParamter1#;
Select
userId,
Questions,
Tags
From
Questions
Where
lastUpdate > #InputParamter1#
That said you will still have to write some code if you add new stuff but it should be limited to updating your sql, adding new containers for your data and then code to display to the end users and then to validate and store it.
Honestly the idea of adding new stuff requiring some work doesn't seem that awful to me.

There is probably a name for this. Please re-title appropriately

I'm evaluating the idea of building a set of generic database tables that will persist user input. There will then be a secondary process to kick off a workflow and process the input.
The idea is that the notion of saving the initial user input is separate from processing and putting it into the structured schema for a particular application.
An example might be some sort of job application or quiz with open-ended questions. The raw answers will not be super valuable to us for aggregate reporting without some human classification. But, we do want to store the raw input as a historical record.
We may also want the user to be able to partially fill out some information and have it persisted until he returns.
Processing all the input to the point where we can put it into our application-specific data schema may not be possible until we have ALL the data.
Two initial questions:
Assuming this concept has a name, what is it?
Is this a reasonable approach? Why or why not?
Update:
Here's another way to state the idea. The user is sequentially populating fields in a DTO. I (think I) want to save the DTO to disk even in a partially-complete state. Once the user has completed populating the fields, I want to pull out the DTO and process it for structured saving into a table which represents the specific DTO. I can't, however, save a partially complete or (worse) a temporarily incorrect set of input since some of the input really shouldn't be stored as part of the structured record.
My idea is to create some generic way to save any type of DTO and then pull them out for processing in a specific app as needed. So maybe this generic DTO table stores data relating to customer satisfaction surveys right next to questions answered in a new account setup wizard.
You stated:
My idea is to create some generic way to save any type of DTO and then pull them out for processing in a specific app as needed.
I think you're one level-of-abstration off. I would argue that the entire database is fulfilling the role you want a limited set of tables to perform. You could create some kind of complicated storage schema that wouldn't represent the data in any way, and then (slowly and painfully, from the DBMS's perspective) merge and render a view of the data ... but I would suggest that this is an over-engineered solution.
I've written several applications where, because of custom user requirements, a (sometimes significant) portion of the application is dynamic - constructed by the user, from the schema to the business rules. The ones that manufactured their storage schemas by executing statements like CREATE TABLE and ALTER TABLE were, surprisingly, the ones easiest to maintain. They also allow users to create reports in a very straightforward, expected way.
Sounds like you're initially storing the data in a normalized form(generic), and once you have the complete set you are denormalizing it(structured schema).
You might be speaking about Workflow. You might want to check out Windows Workflow.
The concepts of Workflow are that they mirror the processes of real life. That is to say, you make complete a document, but the document is not complete until it has been approved. In your case, that would be 'Data is entered' but unclassified, so it is stored in the database (dehydrated) and a flag is sent up for whoever needs to deal with the issue. It can persist in this state for as long as necessary. Once someone is able to deal with it, the workflow is kicked off again (hydrated) and continues to the next steps.
Here are some SO questions regarding workflows:
This question: "Is it better to have one big workflow or several smaller specific ones?" clears up some of the ways that workflow can be used, and also highlights some issues with it.
John Saunders has a very good breakdown of what workflow is good for in this question.

Resources