App Engine: how would you... snapshotting entities - google-app-engine

Let's say you have two kinds, Message and Contact, related by a
db.ListProperty of keys on Message. A user creates a message, adds
some contacts as recipients, and emails the message. Later, the user
deletes one of the contact entities that was a recipient of the
message. Our application should delete the appropriate Contact
entity, but we want to preserve the original recipient list for the
message that was sent for the user's records. In essence, we want a
snapshot of the message entity at the time it was sent. If we naively
delete the contact entity, though, we lose snapshot integrity; if not,
we are left with an invalid key.
How would you handle this situation,
either in controller logic or model changes?
class User(db.Model):
email = db.EmailProperty(required=True)
class Contact(db.Model):
email = db.EmailProperty(required=True)
user = db.ReferenceProperty(User, collection_name='contacts')
class Message(db.Model):
recipients = db.ListProperty(db.Key) # contacts
sender = db.ReferenceProperty(User, collection_name='messages')
body = db.TextProperty()
is_emailed = db.BooleanProperty(default=False)

I would add a boolean field "deleted" (or something spiffier, such as the date and time of deletion) to the Contact model -- so that contacts are never physically deleted, but rather only "logically" deleted when that field is set. (This also lets you offer other cool features such as "show my old now-deleted contacts", "undelete" functionality, etc, if you wish).
This is a common approach in all storage systems that are required to maintain historical integrity (and/or similar requirements such as "auditability").
In cases where the sheer amount of logically deleted entities is threatening to damage system performance, the classic alternative is to have a separate, identical model "DeletedContacts", but foreign key constraints require more work, e.g. the Message class would have to have both recipients and deleted_recipients fiels if you needed foreign key integrity (but using just keys, as you're doing, this extra work would not be needed).
I doubt the average user will delete such a huge percentage of their contacts as to warrant the optimization explained in the last paragraph, so in this case I'd go with the simple "deleted" field.

Alternately, you could refactor your Contact model by moving the email address into the key name and setting the user as the parent entity. Your recipients property would change to a string list of raw email addresses. This gives you a static list of email recipients without having to fetch a set of corresponding entities for each one, or requiring that such entities still exist. If you want to fetch the contact entities, you can easily construct their keys from the user and the recipient address.
One limitation here is that the email address on an existing contact entity cannot be changed, but I think you have that problem anyway. Changing a contact address with your existing model would retroactively change the recipients of a sent message, which we know is a problem.

Related

Change appengine ndb key

I have a game where I've (foolishly) made the db key equal to the users login email. I did this several years ago so I've got quite a few users now. Some users have asked to change their email login for my game. Is there a simple way to change the key? As far as I can tell I'd need to make a new entry with the new email and copy all the data across, then delete the old db entry. This is the user model but then I've got other models, like one for each game they are involved in, that store the user key so I'd have to loop though all of them as well and swap out for the new key.
Before I embark on this I wanted to see if anyone else had a better plan. There could be several models storing that old user key so I'm also worried about the process timing out.
It does keep it efficient to pull a db entry as I know the key from their email without doing a search, but it's pretty inflexible in hindsight
This is actually why you don't use a user's email as their key. Use ndb's default randomly generated key ids.
The efficiency you're referring is not having to query the user's email to retrieve the user id. But that only happens once on user login or from your admin screens when looking at someones account.
You should rip the bandade off now and do a schema-migration away from this model.
Create a new user model (i.e. UsersV2) and clone your existing user model into it to generate new ids.
On all models that reference it add a duplicate field user_v2 = ndb.KeyProperty(UsersV2) and populate it with the new key.
Delete the legacy user model
You should use the taskqueue to do something like this and then you won't have to worry about the process timing out:
https://cloud.google.com/appengine/articles/update_schema
Alternatively, if you are determined to do this cascading update everytime a user changes an email, you could set up a similar update_schema task for just that user.
I ended up adding a new property to my user model and running a crawler to copy the string key (the email) to that new property. I changed my code search for that property rather then the key string to get a user item. Most of my users still have keys that equal their email, but I can safely ignore them as if the string is meaningless. I can now change a users email easily without making a new recored and my other models that have pointers to these user keys can remain unchanged.

Sharded ancestor entities in GAE

I'm working on a GAE-based project involving a large user base (possibly millions of users). We use Datastore for persistency. Users will be identified both by username and by e-mail address, so these two properties should be unique across all entities of the kind. Because Datastore doesn't support unique fields other than ID, we need transactions to ensure uniqueness of these fields when new users are registered. And in order to have transactions, User entities need to be enclosed in entity groups.
Having large entity groups is not recommended, as pointed out here. Therefore, given a possible large number of stored users, I'm thinking of putting them into multiple smaller entity groups. Each group would have a common parent with ID generated from the two unique fields (a piece of the MD5 sum for instance). Inserting a new user could look like this (in Python):
#ndb.transactional
def register_new_user(login, email, full_name) :
# validation code omitted
user = User(login = login, email = email, full_name = full_name)
group_id = a_simple_hash(login, email)
group_key = ndb.Key('UserGroup', group_id)
query = User.query(ancestor = group_key).filter(ndb.OR(User.login = login, User.email = email))
if not query.get() :
user.put()
One problem I see with this solution is that it will be impossible to get a User by ID alone. We'd have to use complete entity keys.
Are there any other cons of such approach? Anyone tried something similar?
EDIT
As I've been pointed out in comments, a hash like the one outlined above would not work properly because it would only prevent registering users having non-unique e-mails together with non-unique usernames matching those e-mails. It would work if the hash was computed based on a single field.
Nevertheless, I find the concept of such sharding interesting by itself and perhaps worth of discussion.
An e-mail address is owned by a user and unique. So there is a very small change, somebody will (try to) use the same email address.
So my approch would be: get_or_insert a new login, which makes it easy to login (by key) and next verify if the e-mail address is unique.
If it not unique you can discard or .....do something else
Entity groups have meaning for transactions. I'am interested in your planned transactions, because I do not understand your entity group key hash. Which entities will be part of the entity group, and why?
A user with the same login will be part of another entity group, If i do understand your hash?
It looks like your entity group holds a single entity.
In my opinion you're overthinking here : what's the probability of having two users register with the same username at the same time ?
Very slim. Eventual consistency is good enough for this case, as you don't nanosecond precision...
unless you plan to have more users than facebook, with people registering every second.
Registering with the same email is virtually impossible for different users, since the check has already been done by the email provider for you!
Only a user could try to open two accounts with the same email address. Eventual consistency is good enough for this query too.
Your user entities each belong to their own entity group.
Actually in most use cases, your User is the most obvious root entity : people use the datastore because they need scalability, and most of the time huge scale is needed for user oriented apps.

Looking for Denormalization Advice for Google App Engine

I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:
Event entity - id, name, timstamp, list
property of tags, view count,
creator's username, creator's profile
image id, number of journal entries
it contains, number of total comments
it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
JournalEntry entity - timestamp,
journal text, name of event,
creator's username, creator's profile
image id, list property of comments
(containing commenter username and
image id)
User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
UserComment entity - username, id of event commented on, title of event commented on
TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!
Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.

database design for notification settings

A user can turn on or off
notification settings for his
account, for notifications such as
Changed Account Profile Information,
Received New Message etc
Notification can be sent via email or mobile phone (either push or sms), user can have 1 email only and many mobile phone devices.
Is there any way you would improve the following database design or would you do it differently?
let me know thanks
USER_NOTIFICATION_SETTING
Id
UserId
Notification_SettingCode
NotificationTypeCode
UserDeviceId -- the mobile deviceid
IsEnabled -- true (notification is on), false (notification is off)
NOTIFICATION_SETTING
Code - e.g 1001, 1002
Name -- e.g Changed Account Profile Information, Received New Message etc
NOTIFICATION_TYPE
Code - e.g 1001, 1002
Name -- e.g Email, SMS, Push
USER_DEVICE -- the mobile phone device information
etc...etc...
Or maybe this one which propagates natural keys. This has wider tables, but requires less joins. For example, you can get notifications for a UserName directly from the NotificationQueue.
Or this one, which is good enough if you have phone and email only. So far the simplest -- I think that currently I like this one the best.
What you've done looks pretty good actually. I would out of personal preference do the following:
Eliminate the UserId column on User_Notification_Setting as it should already be on your User_Device table
Get rid of the _s in your table names
Change the Code fields in Notification_Setting and Notification_Type to be Id (even if they are not Identity columns) and then change the foreign key references from other tables to have a more consistent NotificationTypeId field name.
Eliminate the IsEnabled field. The fact that a record exists at the intersection should suffice for having the notification. Deletion of that record means that there is no notification. I can see why you might want to remember that a notification was there at one time and maybe have it there to easily re-enable but I see no information stored at the intersection so deletion is just as good.
Looks good, only a few minor suggestions:
Naming of code fields, use table name then _Code
Add a notification for all changes
There are a couple of things I do not agree with Tahbaza on:
I would leave the user id in, it is then faster to get all notifications for a user
I would leave the isEnabled in, it is then possible to temporarily stop all notifications

Any simple approaches for managing customer data change requests for global reference files?

For the first time, I am developing in an environment in which there is a central repository for a number of different industry standard reference data tables and many different customers who need to select records from these industry standard reference data tables to fill in foreign key information for their customer specific records.
Because these industry standard reference files are utilized by all customers, I want to reserve Create/Update/Delete access to these records for global product administrators. However, I would like to implement a (semi-)automated interface by which specific customers could request record additions, deletions or modifications to any of the industry standard reference files that are shared among all customers.
I know I need something like a "data change request" table specifying:
user id,
user request datetime,
request type (insert, modify, delete),
a user entered text explanation of the change request,
the user request's current status (pending, declined, completed),
admin resolution datetime,
admin id,
an admin entered text description of the resolution,
etc.
What I can't figure out is how to elegantly handle the fact that these data change requests could apply to dozens of different tables with differing table column definitions. I would like to give the customer users making these data change requests a convenient way to enter their proposed record additions/modifications directly into CRUD screens that look very much like the reference table CRUD screens they don't have write/delete permissions for (with an additional text explanation and perhaps request priority field). I would also like to give the global admins a tool that allows them to view all the outstanding data change requests for the users they oversee sorted by date requested or user/date requested. Upon selecting a data change request record off the list, the admin would be directed to another CRUD screen that would be populated with the fields the customer users requested for the new/modified industry standard reference table record along with customer's text explanation, the request status and the text resolution explanation field. At this point the admin could accept/edit/reject the requested change and if accepted the affected industry standard reference file would be automatically updated with the appropriate fields and the data change request record's status, text resolution explanation and resolution datetime would all also be appropriately updated.
However, I want to keep the actual production reference tables as simple as possible and free from these extraneous and typically null customer change request fields. I'd also like the data change request file to aggregate all data change requests across all the reference tables yet somehow "point to" the specific reference table and primary key in question for modification & deletion requests or the specific reference table and associated customer user entered field values in question for record creation requests.
Does anybody have any ideas of how to design something like this effectively? Is there a cleaner, simpler way I am missing?
Option 1
If preserving the base tables is important then I would create a "change details" table as a child to your change request table. I'm envisioning something like
ChangeID
TableName
TableKeyValue
FieldName
ProposedValue
Add/Change/Delete Indicator
So you'd have a row in this table for every proposed field change. The challenge in this scenario is maintaining the mapping of TableName and FieldName values to the actual tables and fields. If your database structure if fairly static then this may not be an issue.
Option 2
Add a ChangeID field to each of your base tables. When a change is proposed add a record to the base table with the ChangeID populated. So as an example if you have a Company table, for a single company you could have multiple records:
CompanyCode ChangeID CompanyName CompanyAddress
----------- -------- ----------- --------------
COMP1 My Company Boston <-- The "live" record
COMP1 1 New Name Boston <-- A proposed change
When the admin commits the change the existing live record is deleted or archived and the ChangeID value is removed from the proposed record making it the live record. It may be a little tricky to handle proposed deletions with this option. This option also has the potential for impacting performance of selecting live data for normal usage. However it does save you the hassle of maintaining a list of table names and field names somewhere in your code.
I'm sure others will have some opinions!

Resources