How to handle achievments/badges/awards for your APP with minimum hit to system? - badge

I like the concept of badges and achievements for a website I am designing. They have been proven to improve ussage/utilization rates and I think could be a large motivator for an app I'd like to develop.
At a high level I can think of 3 ways to do this.
Check for members who meet requirements as a cron job: This doesn't seem like a good idea to me, as the membership grows, the cron job would take longer and longer to do.
Every time an action is completed that could meet the requirements for a badge, check to see if any badges should be awarded: This seems like a good way to do it, but it seems like I could potentially pound the server continuously checking on badges that have already been awarded or that the user may not even be close to.
Every time the user completes an action that could get a badges, check to see if they already have it then check if they meet the requirements: This seems alright as well, but if I'm storing the user as an object, it seems like it could get prohibitively large, or that I may end up hitting the database pretty hard checking for achievements all the time.
Are there any options I'm missing? Are my concerns for one or more approaches overblown?
Edit:
Is this a far less interesting question than I thought it was, or did I ask at a bad time? Did I leave something unclear?

Or combine two of your ideas:
Every time the user completes an action that could get a badge, put the user in a list (if he was not there already) and process this list frequently using cron.
This way you do not have to check each time the user completes an action and you can keep the cron job reasonable.
Of course there are variants: like processing the list when it reaches a certain amount. Or partially check the requirements before adding the user to the list.
I suppose this would depend on the amount of users, the available actions that can be completed, etc.

Related

Customer Deduplication in Booking Application

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

Is this a functional syncing algorithm?

I'm working on a basic syncing algorithm for a user's notes. I've got most of it figured out, but before I start programming it, I want to run it by here to see if it makes sense. Usually I end up not realizing one huge important thing that someone else easily saw that I couldn't. Here's how it works:
I have a table in my database where I insert objects called SyncOperation. A SyncOperation is a sort of metadata on the nature of what every device needs to perform to be up to date. Say a user has 2 registered devices, firstDevice and secondDevice. firstDevice creates a new note and pushes it to the server. Now, a SyncOperation is created with the note's Id, operation type, and processedDeviceList. I create a SyncOperation with type "NewNote", and I add the originating device ID to that SyncOperation's processedDeviceList. So now secondDevice checks in to the server to see if it needs to make any updates. It makes a query to get all SyncOperations where secondDeviceId is not in the processedDeviceList. It finds out its type is NewNote, so it gets the new note and adds itself to the processedDeviceList. Now this device is in sync.
When I delete a note, I find the already created SyncOperation in the table with type "NewNote". I change the type to Delete, remove all devices from processedDevicesList except for the device that deleted the note. So now when new devices call in to see what they need to update, since their deviceId is not in the processedList, they'll have to process that SyncOperation, which tells their device to delete that respective note.
And that's generally how it'd work. Is my solution too complicated? Can it be simplified? Can anyone think of a situation where this wouldn't work? Will this be inefficient on a large scale?
Sounds very complicated - the central database shouldn't be responsible for determining which devices have recieved which updates. Here's how I'd do it:
The database keeps a table of SyncOperations for each change. Each SyncOperation is has a change_id numbered in ascending order (that is, change_id INTEGER PRIMARY KEY AUTOINCREMENT.)
Each device keeps a current_change_id number representing what change it last saw.
When a device wants to update, it does SELECT * FROM SyncOperations WHERE change_id > current_change_id. This gets it the list of all changes it needs to be up-to-date. Apply each of them in chronological order.
This has the charming feature that, if you wanted to, you could initialise a new device simply by creating a new client with current_change_id = 0. Then it would pull in all updates.
Note that this won't really work if two users can be doing concurrent edits (which edit "wins"?). You can try and merge edits automatically, or you can raise a notification to the user. If you want some inspiration, look at the operation of the git version control system (or Mercurial, or CVS...) for conflicting edits.
You may want to take a look at SyncML for ideas on how to handle sync operations (http://www.openmobilealliance.org/tech/affiliates/syncml/syncml_sync_protocol_v11_20020215.pdf). SyncML has been around for a while, and as a public standard, has had a fair amount of scrutiny and review. There are also open source implementations (Funambol comes to mind) that can also provide some coding clues. You don't have to use the whole spec, but reading it may give you a few "ahah" moments about syncing data - I know it helped to think through what needs to be done.
Mark
P.S. A later version of the protocol - http://www.openmobilealliance.org/technical/release_program/docs/DS/V1_2_1-20070810-A/OMA-TS-DS_Protocol-V1_2_1-20070810-A.pdf
I have seen the basic idea of keeping track of operations in a database elsewhere, so I dare say it can be made to work. You may wish to think about what should happen if different devices are in use at much the same time, and end up submitting conflicting changes - e.g. two different attempts to edit the same note. This may surface as a change to the user interface, to allow them to intervene to resolve such conflicts manually.

Is it highly necessary to record the registration date of new website users?

What are the advantages and disadvantages?
That depends on what your site is, and how you use that information. On StackOverflow, you are awarded a "yearling" badge once a full year elapses from the time you registered. Clearly here that information is necessary.
If I were you, I'd save it. It's a small piece of information that may become useful eventually. It's better to have it and not need it than to need it and not have it. It would be rather difficult to extrapolate an accurate registration date retrospectively if you don't store it to begin with.
Advantage:
You don't get in a migration horror when needing it at some point. For a lot of data you cannot find out this data afterwards. You could fake around with MODIFICATION_DATE but often this is not accurate and sits in the future (e.g. when profile can be edited by user).
Disadvantage:
In case you never need this information, you wasted space (though another small data payload column shouldn't make a problem). Further more you have an 'all-time' deprecated field, which can be confusing to new developers ("what is this column for, cannot see where it is used...?")
As mentioned the registration-date is most likely a valuable information I would add it from start on. When thinking of persistent data and its model you sometimes have to think "more" for the future.

When creating a social voting system, should you keep track of downvotes and upvotes separately in the DB?

With things like SO, Digg, Reddit, etc...
Should one keep track of downvotes in the database independent of upvotes? Or should they simply have a "votes" field that is decremented/incremented based off what the user does with no persisting of that?
How should votes be handled?
On SO, up votes earn +10, down votes -2. For this to work they need to be tracked separately. It's quite possible for a controversial answer to generate a few of each, and just showing an aggregate number won't mean much. So I'd say keep them separate.
I would keep them separate. Some questions have a lot of activity (up and down) and you really like to identify those.
Even if you are not interested in the difference right now, an extra field in the table is not that expensive, so it does not hurt to separate it. Because if you want to add it later, there is no way you can retrieve the data if it is not stored separately.
I also assume SO keeps separate votes for CW and non CW entries. Because if the question changes to CW later on, even with a recalc, the original gained/lost rep is kept.
Depends what you wish to do with your data.
If you only want to display votes than I say you only use one field. It's like number of views of a thread on forum. You want to see what gets most clicks, but not how many times someone viewed it.
Voting system on SO is a bit more complex. Since they can cancel all votes from particular user they have to keep track of who voted for/against what. This, I think, is written in another table, but because it is expensive to recalculate all votes every time someone views a question, they keep calculated value in a field, changing it whenever someone votes.
I can advise you to store them separately maybe even with extra data who authored a particular up or downvote. Who knows, you may come up with a nice idea tomorrow and you will need this additional data to implement it.
But it would also be good to have a sort of pre-calculated field (let's call it cache) which is updated whenever an up- or a downvote is submitted. The pages will then be rendered with this precalculated field. This will increase the response time and lessen load on DB.
If it is too costy to recalculate values immediately you may consider runnning some scheduler tasks (once per hour?) which will process up-to-date votes and recalculate the cached values.
Considering the amount of data you'll have in the database for social voting website the additional space for an extra int column to store down-votes is going to negligible so you'd be crazy not to.
Well, given that SO has +10 for an upvote and -2 for a downvote, and there's recalculation occasionally going on, it would need to store them independently.
Otherwise an answer with 10 upvotes and 5 downvotes which originally gave you 90 points, this would recalc to 50 if they weren't stored separately.
I'd keep them separate so that I could review them separately. Socially, upvoting is very different to downvoting, and I'd want to be able to look at them independently if it were me.
Definitely separately for one very simple reason. Tomorrow you'll want to do something extra that needs that information (some sort of report or graph for example). Besides keeping them separately costs you nothing.
since people can usually vote once, and (in SO for example) can cancel their votes, you need to know who voted, at what time, what vote, and on which item.
I am certain that the downvotes and upvotes are kept separately, though there could be an aggregate field that keeps the count. SO lets you change a vote later (make the downvote an upvote), and that's why I believe the votes are logged for each user too.

Creating a Notifications type feed in GAE Objectify

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.
There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.
What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Resources