Cost of updating entities in datastore (and, possible to append properties)? - google-app-engine

I have a two part question.
Let's say I have a entity with a blob property...
# create entity
Entity(ndb.Model):
blob = ndb.BlobProperty(indexed=False)
e = Entity()
e.blob = 'abcd'
e_key = e.put()
# update entity
e = e_key.get()
e.blob += 'efg'
e.put()
So questions are:
The first time I put() that entity, the cost is 2 Write Ops; how many Ops does it cost to update the entity, as in the above example?
When I added 'efg' to the property, the old property had to be read into memory first, does app engine provide a way to append the old value without reading it first?

There are no partial updates. Every time you overwrite the whole entity. Numbers of indexes will also have an impact on cost. You might like to have a look at https://developers.google.com/appengine/articles/life_of_write for a detailed breakdown of what happens.

Related

Does usage of customer key id generates additional GAE datastore write operation?

Let's say I have the following class:
class Data(ndb.Model):
data = ndb.StringProperty(required=True, indexed=False)
Is the number of write operations equal in the following two cases:
record = Data()
record.data = data_string
record.put_async()
record = Data(id=data_string) # custom id is used
record.data = data_string
record.put_async()
Or, the second case requires more write operations? Understanding write cost Google article doesn't clarify it.
You don't need to store data_string twice in your second example. If you use a string to create an entity key, you can extract it back from the key.
Because you do not index this property, however, the writing costs would be the same, but the data volume will be smaller.

Google Datastore queries and eventual consistency

I would like to confirm my understanding of eventual consistency in the Google datastore. Suppose that I have an entity defined as follows (using ndb):
class Record(ndb.Model):
name = ndb.StringProperty()
content = ndb.BlobProperty()
I think I understand Scenarios 1, but I have doubts about Scenarios 2 and 3, so some advice would be highly appreciated.
Scenario 1: I insert a new Record with name "Luca" and a given content. Then, I query the datastore:
qry = Record.query(name=="Luca")
for r in qry.iter():
logger.info("I got this content: %r" % r.content)
I understand that, due to eventual consistency, the just-inserted record might not be part of the result set. I know about using ancestor queries in order to over come this if needed.
Scenario 2: I read an existing Record with name "Luca", update the content, and write it back. For instance, assuming I have the key "k" of this record:
r = k.get()
r.content = "new content"
r.put()
Then, I run the same query as in Scenario 1. When I get the results, assume that the record is part of the result set (for instance, because the index already contained the record with name "Luca" and key k). Am I then guaranteed that the field content will have its new value "new content"?
In other words, if I update a record, leaving its key and indexed fields alone, am I guaranteed to read the most recent value?
Scenario 3: I do similarly to Scenario 2, again where k is the key of a record with name "Luca":
r = k.get()
r.content = "new content"
r.put()
but then I run a modified version of the query:
qry = Record.query(name=="Luca")
for k in qry.iter(keys_only=True):
r = k.get()
logger.info("I got this content: %r" % r.content)
In this case, logic tells me I should be getting the latest value of the content, because reading by key guarantees strong consistency. I would appreciate confirmation.
Scenario 1. Yes, your understanding is correct.
Scenario 2. No, same query, so still eventually consistent.
Scenario 3. Yes, your understanding is correct.
Also you can avoid eventual consistency by doing everything in the same transaction, but of course this may not be applicable.

Google App Engine ndb performance on repeated property

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.
No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

ndb retrieving entity key by ID without parent

I want to get an entity key knowing entity ID and an ancestor.
ID is unique within entity group defined by the ancestor.
It seems to me that it's not possible using ndb interface. As I understand datastore it may be caused by the fact that this operation requires full index scan to perform.
The workaround I used is to create a computed property in the model, which will contain the id part of the key. I'm able now to do an ancestor query and get the key
class SomeModel(ndb.Model):
ID = ndb.ComputedProperty( lambda self: self.key.id() )
#classmethod
def id_to_key(cls, identifier, ancestor):
return cls.query(cls.ID == identifier,
ancestor = ancestor.key ).get( keys_only = True)
It seems to work, but are there any better solutions to this problem?
Update
It seems that for datastore the natural solution is to use full paths instead of identifiers. Initially I thought it'd be too burdensome. After reading dragonx answer I redesigned my application. To my suprise everything looks much simpler now. Additional benefits are that my entities will use less space and I won't need additional indexes.
I ran into this problem too. I think you do have the solution.
The better solution would be to stop using IDs to reference entities, and store either the actual key or a full path.
Internally, I use keys instead of IDs.
On my rest API, I used to do http://url/kind/id (where id looked like "123") to fetch an entity. I modified that to provide the complete ancestor path to the entity: http://url/kind/ancestor-ancestor-id (789-456-123), I'd then parse that string, generate a key, and then get by key.
Since you have full information about your ancestor and you know your id, you could directly create your key and get the entity, as follows:
my_key = ndb.Key(Ancestor, ancestor.key.id(), SomeModel, id)
entity = my_key.get()
This way you avoid making a query that costs more than a get operation both in terms of money and speed.
Hope this helps.
I want to make a little addition to dargonx's answer.
In my application on front-end I use string representation of keys:
str(instance.key())
When I need to make some changes with instence even if it is a descendant I use only string representation of its key. For example I have key_str -- argument from request to delete instance':
instance = Kind.get(key_str)
instance.delete()
My solution is using urlsafe to get item without worry about parent id:
pk = ndb.Key(Product, 1234)
usafe = LocationItem.get_by_id(5678, parent=pk).key.urlsafe()
# now can get by urlsafe
item = ndb.Key(urlsafe=usafe)
print item

Resources