Achieving Strong Consistency Using get_or_insert - google-app-engine

I have a model like this:
class UserModel(ndb.Model):
''' model class which stores all the user information '''
fname = ndb.StringProperty(required=True)
lname = ndb.StringProperty(required=True)
sex = ndb.StringProperty(required=True, choices=['male', 'female'])
age = ndb.IntegerProperty(required=True)
dob = ndb.DateTimeProperty(required=True)
email = ndb.StringProperty(default=None)
mobile = ndb.StringProperty(required=True)
city = ndb.StringProperty(required=True)
state = ndb.StringProperty(required=True)
Since none of above fields are unique, not even email becuase many people may no have email ids. So I am using the following logic to create a string id
1. Take first two letters of 'state' and change it to upper case.
2. Take first to letters of 'city' and change it to upper case.
3. Get the count of all records in the database and increment by one.
4. Append all of them together.
I am using get_or_insert for inserting the entity.
Though adding a user, will not happen too often but any kind of clash would be catastrophic, means probability of contention is less but its impact is very high.
My questions are:
1. Will using get_or_insert guarantee that I will never have duplicate IDs?
2. get_or_insert documentation says "Transactionally retrieves an existing
entity or creates a new one.". How can something perform an operation
"transactionally" without using a ancestor query.
PS: For several reasons I can't keep all the user entities in the same entity groups.

In order to provide transactionality, get_or_insert uses a Datastore transaction. In order to use a query in a transaction it must be an ancestor query, however transactions can also get and put, which don't require a parent to be set on the entity.
However, as #Greg mentioned, you absolutely do not want to use this scheme for generating user ids. In particular, doing a count on your db is incredibly slow and will not scale, and is eventually consistent. Because the query is eventually consistent, it may return a count smaller than the actual count as long as results are eventually consistent (which for a large app will be all the time). This means you could wait several hours before an insert would actually succeed.
If you want to provide a customer ID with a State and City, I would recommend doing the following:
Do a put using automatic ids.
Expose to the user a "Customer ID" which is the State + City + ID.
When you want to lookup a customer given their "Customer ID", just do a get for the ID portion.

if you keep that ID scheme (for which you honestly don't really need steps 1 and 2, just 3), there is no reason for it to create duplicate IDs. With get_or_insert, it'll look for the exact ID you provide and fetch it if it exists, or simply create it if it doesn't, as explained here. So you CANNOT have duplicate IDs (well if you have this ID as your forced key in your model). if you follow the link provided it clearly states that :
The get and subsequent (possible) put operations are wrapped in a transaction to ensure atomicity. Ths means that get_or_insert() will never overwrite an existing entity, and will insert a new entity if and only if no entity with the given kind and name exists.
And the fact it does it transactionnaly means it'll lock up the entity group to be sure you don't have contention. Since you don't seem to have ancestors I think it'll just lock the entity you're updating

Related

What does this sentence in the motivation of having views in database mean?

I was wondering what the following sentence (particularly the highlighted parts) about the motivation of having views for databases means:
if we did so, and the
underlying data in the relations instructor, course, or section changes, the stored
query results would then no longer match the result of reexecuting the query on
the relations.
It comes from Database System Concepts (Silberschatz):
4.2 Views
It is not desirable for all users to see the entire logical model.
Security considerations may require that certain data be hidden from
users. Consider a clerk who needs to know an instructor’s ID, name and
department name, but does not have authorization to see the
instructor’s salary amount. This person should see a relation
described in SQL, by:
select ID, name, dept name
from instructor;
Aside from security concerns, we may wish to create a personalized
collection of relations that is better matched to a certain user’s
intuition than is the logical model. We may want to have a list of all
course sections offered by the Physics department in the Fall 2009
semester, with the building and room number of each section. The
relation that we would create for obtaining such a list is:
select course.course id, sec id, building, room number
from course, section
where course.course id = section.course id
and course.dept name = ’Physics’
and section.semester = ’Fall’
and section.year = ’2009’;
It is possible to compute and store the results of the above queries
and then make the stored relations available to users. However, if
we did so, and the underlying data in the relations instructor,
course, or section changes, the stored query results would then no
longer match the result of reexecuting the query on the relations.
In general, it is a bad idea to compute and store query results such
as those in the above examples (although there are some exceptions,
which we study later).
Instead, SQL allows a “virtual relation” to be defined by a query, and
the relation conceptually contains the result of the query. The
virtual relation is not precomputed and stored, but instead is
computed by executing the query when- ever the virtual relation is
used. Any such relation that is not part of the logical model, but is
made visible to a user as a virtual relation, is called a view. It is
possible to support a large number of views on top of any given set of
actual relations.
It's stating that using a view to precompute and store a result does not guarantee the stored result will remain current as when the data in the table changes (added, updated, removed, etc).
So, if I execute and store the result of
select course.course_id, sec_id, building, room number
from course, section
where course.course_id = section.course_id
and course.dept name = ’Physics’
and section.semester = ’Fall’
and section.year = ’2009’;
and the course identified by course_id is deleted, the stored result is not going to be updated to reflect the new information because it is static and would need to be purged and recomputed regularly.
Using the virtual relation mechanism mentioned means that, when the record is deleted, the virtual relationship is triggered and updates the view is updated automatically.

Sharded ancestor entities in GAE

I'm working on a GAE-based project involving a large user base (possibly millions of users). We use Datastore for persistency. Users will be identified both by username and by e-mail address, so these two properties should be unique across all entities of the kind. Because Datastore doesn't support unique fields other than ID, we need transactions to ensure uniqueness of these fields when new users are registered. And in order to have transactions, User entities need to be enclosed in entity groups.
Having large entity groups is not recommended, as pointed out here. Therefore, given a possible large number of stored users, I'm thinking of putting them into multiple smaller entity groups. Each group would have a common parent with ID generated from the two unique fields (a piece of the MD5 sum for instance). Inserting a new user could look like this (in Python):
#ndb.transactional
def register_new_user(login, email, full_name) :
# validation code omitted
user = User(login = login, email = email, full_name = full_name)
group_id = a_simple_hash(login, email)
group_key = ndb.Key('UserGroup', group_id)
query = User.query(ancestor = group_key).filter(ndb.OR(User.login = login, User.email = email))
if not query.get() :
user.put()
One problem I see with this solution is that it will be impossible to get a User by ID alone. We'd have to use complete entity keys.
Are there any other cons of such approach? Anyone tried something similar?
EDIT
As I've been pointed out in comments, a hash like the one outlined above would not work properly because it would only prevent registering users having non-unique e-mails together with non-unique usernames matching those e-mails. It would work if the hash was computed based on a single field.
Nevertheless, I find the concept of such sharding interesting by itself and perhaps worth of discussion.
An e-mail address is owned by a user and unique. So there is a very small change, somebody will (try to) use the same email address.
So my approch would be: get_or_insert a new login, which makes it easy to login (by key) and next verify if the e-mail address is unique.
If it not unique you can discard or .....do something else
Entity groups have meaning for transactions. I'am interested in your planned transactions, because I do not understand your entity group key hash. Which entities will be part of the entity group, and why?
A user with the same login will be part of another entity group, If i do understand your hash?
It looks like your entity group holds a single entity.
In my opinion you're overthinking here : what's the probability of having two users register with the same username at the same time ?
Very slim. Eventual consistency is good enough for this case, as you don't nanosecond precision...
unless you plan to have more users than facebook, with people registering every second.
Registering with the same email is virtually impossible for different users, since the check has already been done by the email provider for you!
Only a user could try to open two accounts with the same email address. Eventual consistency is good enough for this query too.
Your user entities each belong to their own entity group.
Actually in most use cases, your User is the most obvious root entity : people use the datastore because they need scalability, and most of the time huge scale is needed for user oriented apps.

How would I achieve this using Google App Engine Datastore?

I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?
Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.
(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).
I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()

creating a compound or composite key on google app engine

I have two models:
Car(ndb.Model) and Branch(ndb.Model) each with a key method.
#classmethod
def car_key(cls, company_name, car_registration_id):
if not (company_name.isalnum() and car_registration_id.isalnum()):
raise ValueError("Company & car_registration_id must be alphanumeric")
key_name = company_name + "-" + car_registration_id
return ndb.Key("Car", key_name)
Branch Key:
#classmethod
def branch_key(cls, company_name, branch_name):
if not (company_name.isalnum() and branch_name.isalnum()):
raise ValueError("Company & Branch names must be alphanumeric")
key_name = company_name + "-" + branch_name
return ndb.Key("Branch", key_name)
However I'm thinking this is a bit ugly and not really how you're supposed to use keys.
(the car registration is unique to a car but sometimes one company may sell a car to another company and also cars move between branches).
Since a company may many cars or many branches, I suppose I don't want large entity groups because you can only write to an entity group once per second.
How should I define my keys?
e.g. I'm considering car_key = ndb.Key("Car", car_reg_id, "Company", company_name)
since it's unlikely for a car to have many companies so the entity group wont be too big.
However I'm not sure what to do about the branch key since many companies may have the same branch name, and many branches may have the same company.
You've rightly identified that ancestor relationships in GAE should not be based on the logical structure of your data.
They need to be based on the transactional behavior of your application. Ancestors make your life difficult. For example, once you use a compound key, you won't be able to fetch that entity by key unless you happen to know all the elements of the key. If you knew the Car id, you wouldn't be able to fetch it without also knowing the other component.
Consider what queries you would need to have strong consistency for. If you do happen to need strong consistency when querying all the cars in a given branch, then you should consider using that as an ancestor.
Consider what operations need to be done in a transaction, that's another good reason for using an entity group.
Keep in mind also, you might not need any entity group at all (probably the answer for your situation).
Or, on the flip side, you might need an entity group that might not exactly fit any logical conceptual model, but the ancestor might be an entity that exists purely to exists because you need an ancestor for a certain transaction.

Database design - system default items and custom user items

This question applies to any database table design, where you would have system default items and custom user defaults of the same type (ie user can add his own custom items/settings).
Here is an example of invoicing and paymenttypes, By default an invoice can have payment terms of DueOnReceipt, NET10, NET15, NET30 (this is the default for all users!) therefore you would have two tables "INVOICE" and "PAYMENT_TERM"
INVOICE
Id
...
PaymentTermId
PAYMENT_TERM (System default)
Id
Name
Now what is the best way to allow a user to store their own custom "PaymentTerms" and why? (ie user can use system default payment terms OR user's own custom payment terms that he created/added)
Option 1) Add UserId to PaymentTerm, set userid for the user that has added the custom item and system default userid set to null.
INVOICE
Id
...
PaymentTermId
PaymentTerm
Id
Name
UserId (System Default, UserId=null)
Option 2) Add a flag to Invoice "IsPaymentTermCustom" and Create a custom table "PAYMENT_TERM_CUSTOM"
INVOICE
Id
...
PaymentTermId
PaymentTermCustomId
IsPaymentTermCustom (True for custom, otherwise false for system default)
PaymentTerm
Id
Name
PAYMENT_TERM_CUSTOM
Id
Name
UserId
Now check via SQL query if the user is using a custom payment term or not, if IsPaymentTermCustom=True, it means the user is using custom payment term otherwise its false.
Option 3) ????
...
As a general rule:
Prefer adding columns to adding tables
Prefer adding rows to adding columns
Generally speaking, the considerations are:
Effects of adding a table
Requires the most changes to the app: You're supporting a new kind of "thing"
Requires more complicated SQL: You'll have to join to it somehow
May require changes to other tables to add a foreign key column referencing the new table
Impacts performance because more I/O is needed to join to and read from the new table
Note that I am not saying "never add tables". Just know the costs.
Effects of adding a column
Can be expensive to add a column if the table is large (can take hours for the ALTER TABLE ADD COLUMN to complete and during this time the table wil be locked, effectively bringing your site "down"), but this is a one-time thing
The cost to the project is low: Easy to code/maintain
Usually requires minimal changes to the app - it's a new aspect of a thing, rather than a new thing
Will perform with negligible performance difference. Will not be measurably worse, but may be a lot faster depending on the situation (if having the new column avoids joining or expensive calculations).
Effects of adding rows
Zero: If your data model can handle your new business idea by just adding more rows, that's the best option
(Pedants kindly refrain from making comments such as "there is no such thing as 'zero' impact", or "but there will still be more disk used for more rows" etc - I'm talking about material impact to the DB/project/code)
To answer the question: Option 1 is best (i.e. add a column to the payment option table).
The reasoning is based on the guidelines above and this situation is a good fit for those guidelines.
Further,
I would also store "standard" payment options in the same table, but with a NULL userid; that way you only have to add new payment options when you really have one, rather than for every customer even if they use a standard one.
It also means your invoice table does not need changing, which is a good thing - it means minimal impact to that part of your app.
It seems to me that there are merely "Payment Terms" and "Users". The decision of what are the "Default" payment terms is a business rule, and therefore would be best represented in the business layer of your application.
Assuming that you would like to have a set of pre-defined "default" payment terms present in your application from the start, these would already be present in the payment terms table. However, I would put a reference table in between USERS and PAYMENT TERMS:
USERS:
user-id
user_namde
USER_PAYMENT_TERMS:
userID
payment_term_id
PAYMENT_TERMS:
payment_term_id
payment_term
Your business layer should offer up to the user (or more likely, the administrator) through a GUI the ability to:
Assign 0 to many payment term options to a particular user (some
users may not want one of the defaults to even be available, for
example.
Add custom payment terms, which then become available for assignment to one or more users (but which avoids the creation of duplicate payment terms by different users)
Allows the definition of a custom payment term to be assigned to more than one user (say the user's company a unique payment process which requires all of their users to utilize a payment term other than one of the defaults? Create the custom term once, and assign to all users.
Your application business layer would establish rules governing access to payment terms, which could then be accessed by your user interface.
Your UI would then (again, likely through an administrator function) allow the set up of one or more payment terms in addition to the standards you describe, and then make them available to one or more users through something like a checked list box (for example).
Option 1 is definately better for the following reasons:-
Correctness
You can implement a database constraint for uniqueness of the payment term name
You can implement a foreign key constraint from Invoice to PaymentTerm
Ease of Use
Conducting queries will be much simplier because you will always join from Invoice to PaymentTerm rather than requiring a more complex join. Most of the time when you select you will not care if it is an inbuilt or custom payment term. The optimizer will have an easier time with a normal join instead of one that depends on another column to decide which table to join.
Easier to display a list of PaymentTerms coming from one table
We use Option 1 in our data-model quite alot.
Part of the problem, as I see it, is that different payment terms lead to different calculations, too. If I were still in the welding supply business, I'd want to add "2% 10 NET 30", which would mean 2% discount if the payment is made in full within 10 days, otherwise, net 30."
Setting that issue aside, I think ownership of the payment terms makes sense. Assume that the table of users (not shown) includes the user "system" as, say, user_id 0.
create table payment_terms (
payment_term_id integer primary key,
payment_term_owner_id integer not null references users (user_id),
payment_term_desc varchar(30) not null unique,
);
insert into payment_terms values (1, 0, 'Net 10');
insert into payment_terms values (2, 0, 'Net 15');
...
insert into payment_terms values (5, 1, '2% 10, Net 30');
This keeps foreign keys simple, and it makes it easy to select payment terms at run time for presentation in the user interface.
Be very careful here. You probably want to store the description, not the ID number, with your invoices. (It's unique; you can set a foreign key reference to it.) If you store only the ID number, updating a user's custom description might subtly corrupt all the data that references it.
For example, let's say that the user created a custom payment term number 5, '2% 10, Net 30'. You store the ID number 5 in your table of invoices. Then the user decides that things will be different starting today, and updates that description to '2% 10, Net 20'. Now on all your past invoices, the arithmetic no longer matches the payment terms.
Your auditor will kill you. Twice.
You'll want to prevent ordinary users from deleting rows owned by the system user. There are several ways to do that.
Use a BEFORE DELETE trigger.
Add another table with foreign key references to the rows owned by the system user.
Restrict all access through stored procedures that prevent deleting system rows.
(And flags are almost never the best idea.)
Applying general rules of database design to the problem at hand:
one table for system payment terms
one table for user payment terms
a view of join of the two above
Now you can join invoice on the view of payment terms.
Benefits:
No flag columns
No nulls
You separate system defaults from user data
Things become straight forward for the db

Resources