Google Datastore queries and eventual consistency - google-app-engine

I would like to confirm my understanding of eventual consistency in the Google datastore. Suppose that I have an entity defined as follows (using ndb):
class Record(ndb.Model):
name = ndb.StringProperty()
content = ndb.BlobProperty()
I think I understand Scenarios 1, but I have doubts about Scenarios 2 and 3, so some advice would be highly appreciated.
Scenario 1: I insert a new Record with name "Luca" and a given content. Then, I query the datastore:
qry = Record.query(name=="Luca")
for r in qry.iter():
logger.info("I got this content: %r" % r.content)
I understand that, due to eventual consistency, the just-inserted record might not be part of the result set. I know about using ancestor queries in order to over come this if needed.
Scenario 2: I read an existing Record with name "Luca", update the content, and write it back. For instance, assuming I have the key "k" of this record:
r = k.get()
r.content = "new content"
r.put()
Then, I run the same query as in Scenario 1. When I get the results, assume that the record is part of the result set (for instance, because the index already contained the record with name "Luca" and key k). Am I then guaranteed that the field content will have its new value "new content"?
In other words, if I update a record, leaving its key and indexed fields alone, am I guaranteed to read the most recent value?
Scenario 3: I do similarly to Scenario 2, again where k is the key of a record with name "Luca":
r = k.get()
r.content = "new content"
r.put()
but then I run a modified version of the query:
qry = Record.query(name=="Luca")
for k in qry.iter(keys_only=True):
r = k.get()
logger.info("I got this content: %r" % r.content)
In this case, logic tells me I should be getting the latest value of the content, because reading by key guarantees strong consistency. I would appreciate confirmation.

Scenario 1. Yes, your understanding is correct.
Scenario 2. No, same query, so still eventually consistent.
Scenario 3. Yes, your understanding is correct.
Also you can avoid eventual consistency by doing everything in the same transaction, but of course this may not be applicable.

Related

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Cost of updating entities in datastore (and, possible to append properties)?

I have a two part question.
Let's say I have a entity with a blob property...
# create entity
Entity(ndb.Model):
blob = ndb.BlobProperty(indexed=False)
e = Entity()
e.blob = 'abcd'
e_key = e.put()
# update entity
e = e_key.get()
e.blob += 'efg'
e.put()
So questions are:
The first time I put() that entity, the cost is 2 Write Ops; how many Ops does it cost to update the entity, as in the above example?
When I added 'efg' to the property, the old property had to be read into memory first, does app engine provide a way to append the old value without reading it first?
There are no partial updates. Every time you overwrite the whole entity. Numbers of indexes will also have an impact on cost. You might like to have a look at https://developers.google.com/appengine/articles/life_of_write for a detailed breakdown of what happens.

ndb retrieving entity key by ID without parent

I want to get an entity key knowing entity ID and an ancestor.
ID is unique within entity group defined by the ancestor.
It seems to me that it's not possible using ndb interface. As I understand datastore it may be caused by the fact that this operation requires full index scan to perform.
The workaround I used is to create a computed property in the model, which will contain the id part of the key. I'm able now to do an ancestor query and get the key
class SomeModel(ndb.Model):
ID = ndb.ComputedProperty( lambda self: self.key.id() )
#classmethod
def id_to_key(cls, identifier, ancestor):
return cls.query(cls.ID == identifier,
ancestor = ancestor.key ).get( keys_only = True)
It seems to work, but are there any better solutions to this problem?
Update
It seems that for datastore the natural solution is to use full paths instead of identifiers. Initially I thought it'd be too burdensome. After reading dragonx answer I redesigned my application. To my suprise everything looks much simpler now. Additional benefits are that my entities will use less space and I won't need additional indexes.
I ran into this problem too. I think you do have the solution.
The better solution would be to stop using IDs to reference entities, and store either the actual key or a full path.
Internally, I use keys instead of IDs.
On my rest API, I used to do http://url/kind/id (where id looked like "123") to fetch an entity. I modified that to provide the complete ancestor path to the entity: http://url/kind/ancestor-ancestor-id (789-456-123), I'd then parse that string, generate a key, and then get by key.
Since you have full information about your ancestor and you know your id, you could directly create your key and get the entity, as follows:
my_key = ndb.Key(Ancestor, ancestor.key.id(), SomeModel, id)
entity = my_key.get()
This way you avoid making a query that costs more than a get operation both in terms of money and speed.
Hope this helps.
I want to make a little addition to dargonx's answer.
In my application on front-end I use string representation of keys:
str(instance.key())
When I need to make some changes with instence even if it is a descendant I use only string representation of its key. For example I have key_str -- argument from request to delete instance':
instance = Kind.get(key_str)
instance.delete()
My solution is using urlsafe to get item without worry about parent id:
pk = ndb.Key(Product, 1234)
usafe = LocationItem.get_by_id(5678, parent=pk).key.urlsafe()
# now can get by urlsafe
item = ndb.Key(urlsafe=usafe)
print item

Django: Avoiding ABA scenario in Database and ORM’s

I have a situation where I need to update votes for a candidate.
Citizens can vote for this candidate, with more than one vote per candidate. i.e. one person can vote 5 votes, while another person votes 2. In this case this candidate should get 7 votes.
Now, I use Django. And here how the pseudo code looks like
votes = candidate.votes
vote += citizen.vote
The problem here, as you can see is a race condition where the candidate’s votes can get overwritten by another citizen’s vote who did a select earlier and set now.
How can avoid this with an ORM like Django?
If this is purely an arithmetic expression then Django has a nice API called F expressions
Updating attributes based on existing fields
Sometimes you'll need to perform a simple arithmetic task on a field, such as incrementing or decrementing the current value. The obvious way to achieve this is to do something like:
>>> product = Product.objects.get(name='Venezuelan Beaver Cheese')
>>> product.number_sold += 1
>>> product.save()
If the old number_sold value retrieved from the database was 10, then the value of 11 will be written back to the database.
This can be optimized slightly by expressing the update relative to the original field value, rather than as an explicit assignment of a new value. Django provides F() expressions as a way of performing this kind of relative update. Using F() expressions, the previous example would be expressed as:
>>> from django.db.models import F
>>> product = Product.objects.get(name='Venezuelan Beaver Cheese')
>>> product.number_sold = F('number_sold') + 1
>>> product.save()
This approach doesn't use the initial value from the database. Instead, it makes the database do the update based on whatever value is current at the time that the save() is executed.
Once the object has been saved, you must reload the object in order to access the actual value that was applied to the updated field:
>>> product = Products.objects.get(pk=product.pk)
>>> print product.number_sold
42
Perhaps the select_for_update QuerySet method is helpful for you.
An excerpt from the docs:
All matched entries will be locked until the end of the transaction block, meaning that other transactions will be prevented from changing or acquiring locks on them.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Mind that this is only available in the Django development release (i.e. > 1.3).

How can I verify if record with particular key_name already exists in the datastore?

I need to add products to my datastore. In case if product already exists, then I need to assign another key_name.
Let's say, I already have:
product = MyDBModelProduct(key_name='milk')
now when I add another products (milk again), I need to have them added with key_names milk-1, milk-2, milk-3 and so on.
So, before adding new db record, I need to verify if milk is already there.
I see 3 possible ways to do so:
(1) GqlQuery usage:
proposed_key_name = 'milk'
product = db.GqlQuery("SELECT * FROM MyDBModelProduct WHERE key_name = :1 LIMIT 1", proposed_key_name).fetch(1)
if (len(product) > 0): # can not use this name, should look for milk-2, milk-3...
(2) all usage:
query = MyDBModelProduct.all()
count = query.filter('key_name =', proposed_key_name).count()
(3) get_by_key_name usage:
id_key_name = proposed_key_name
id_count = 1
while MyDBModelProduct.get_by_key_name(id_key_name) is not None:
id_key_name = proposed_key_name + '-' + id_count
id_count += 1
What is the best approach? Which one will work faster?
Options 1 and 2 are identical. Neither will work, because key_name is not a valid filter - instead you need to construct a fully qualified key and query on it - eg, .filter('__key__ =', my_key). You should also do a keys-only query - SELECT __key__ or MyDBModelProduct.all(keys_only=True), since you don't need the actual model entity.
Number 3 is faster than 1 and 2 as they stand, but if you modify the query as described above, 1 or 2 are faster. If you need to do this in bulk, get/get_by_key_name support batch operations, whilst Query does not.
Your proposed usage is rather perplexing, though - the whole point of key names are to use a natural key that has some external meaning, but you're inventing new ones depending on whether the existing one is used already.

Resources