reduce google datastore read operation fee - google-app-engine

I have a Kind XXX_account with 1000 entities. The Kind file size is 3 mb. Whenever I send a request, the Query need to be called to find a certain entity in the Kind. Therefore, I think the google fee is almost 4 usd in just a 20 hours.
Is there anyway to reduce the datastore read operations? I plan to store 1000 entities in txt file so that I do need to read datastore everytime.
Datastore Read Operations 5.01 Million Ops 4.96 $0.70/ Million Ops $3.48
My model.py
class MyUser(DatastoreUser):
pass
class XXXAccount(db.Model):
user = db.ReferenceProperty(MyUser,
collection_name='xxx_accounts')
id = db.StringProperty(required=True)
created = db.DateTimeProperty(auto_now_add=True)
updated = db.DateTimeProperty(auto_now=True)
name = db.StringProperty(required=True)
username = db.StringProperty(required=True)
profile_url = db.StringProperty(required=True)
aaa = db.StringProperty(required=True)
bbb = db.StringProperty(required=True)
view.py
#login_required
def updateprofile(request):
number_form = NumberForm()
if request.method =="POST" and number_form.validate(request.form):
acc_num_str = number_form['nb']
acc_num = int(acc_num_str)
current_user = request.user
xxx_account = current_user.xxx_accounts[acc_num] #Query
DO SOME THING WHICH DOES NOT RELATED TO READ AND WRITE DATASTORE OPERATION
return......
UPDATE:
Code was posted
OMG, 0.32 USD for just 1000 requests.

You should post your model definition and code where you do querying entities.
Common recommendations:
If you want to find certain entity(ies), there is only one right way to do it - use entity key (id number or key_name string) to get it. Datastore automatically assigns some id to entity when it saves it or you can manually set some nice key_name when you're creating entity.
To get entity's id or key_name use Model.key().id() or Model.key().name() in DB or Model.key.id() in NDB.
Then you can get entity by id or key_name with Model.get_by_id() or Model.get_by_key_name() methods if you're using old DB API or Key.get() method if you're using new NDB API. You can pass id or key_name to URL - http://example.com/getentity/[id].
Also, use Memcache to cache entities. Caching can extremely decrease using of Datastore. By the way, NDB automatically uses cache.
p.s. Sorry, I cannot post more than 2 links.

Related

How to Improve Django Tastypie web server performance

I have a django web server with Tastypie API. The performance is extremely slow, and I am not sure where to look.
The problem can be abstracted this way. It simply has 3 tables.
class Table1(models.Model):
name = models.CharField(max_length=64)
class Table2(models.Model):
name = models.CharField(max_length=64)
table1 = models.ForeignKey(Table1)
class Table3(models.Model):
name = models.CharField(max_length=64)
table2 = models.ForeignKey(Table2)
Table1 has about 50 record. Table2 has about 400 record. Table3 has about 2000 record. MySQL is used.
It has 3 model resource:
class Table1Resource(ModelResource):
class Meta(object):
"""Define options attached to model."""
queryset = models.Table1.objects.all()
resource_name = 'table1'
class Table2Resource(ModelResource):
class Meta(object):
"""Define options attached to model."""
queryset = models.Table2.objects.all()
resource_name = 'table2'
class Table3Resource(ModelResource):
class Meta(object):
"""Define options attached to model."""
queryset = models.Table3.objects.all()
resource_name = 'table3'
The front-end uses ajax to call 3 web service APIsto retrieve all data in database. My machine has very good configuration, such as 16 GB memory. But, it takes about 40 seconds to load all data. Too slow. It's obvious something is not right.
I tried some Django data model function to improve performance
1) Django queryset. I notice the API retrieve all table objects if there is foreign key. Table3Resource access is extremely slow. In my case, I just want data in 1 table, not interested in inner join result from another table. For example, it uses models.Table3.objects.all().
I tried models.LabSpace.objects.select_relate(). No help at all.
2) For such small amount of data with such low performance, I am not even thinking Tastypie API cache technique yet. I feel somewhere is obviously wrong.
Basically, I am not sure if it is Django or Tastypie issue. Where should I look?
You should specify the Resource ForeignKey field. Default is False I believe, so you just have to do this:
class Table2Resource(ModelResource):
table1 = fields.ToOneField(Table1Resource)
class Meta(object):
"""Define options attached to model."""
queryset = models.Table2.objects.all()
resource_name = 'table2'
# etc ...
If not you can try to explicitly set it like so:
table1 = fields.ToOneField(Table1Resource, 'table1', full=False)

ndb Models are not saved in memcache when using MapReduce

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.
I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.
However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).
Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:
def product_bulk_import_map(data):
"""Product Bulk Import map function."""
result = {"status" : "CREATED"}
product_data = data
try:
# parse input parameter tuple
byteoffset, line_data = data
# parse base product data
product_data = [x for x in csv.reader([line_data])][0]
(p_id, c_id, p_type, p_description) = product_data
# process category
category = Category.get_by_id(c_id)
if category is None:
raise Exception(product_import_error_messages["category"] % c_id)
# store in datastore
product = Product.get_by_id(p_id)
if product is not None:
result["status"] = "UPDATED"
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description
)
product.put()
result["entity"] = product.to_dict()
except Exception as e:
# catch any exceptions, and note failure in output
result["status"] = "FAILED"
result["entity"] = str(e)
# return results
yield (str(product_data), result)
MapReduce intentionally disables memcache for NDB.
See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):
def _set_ndb_cache_policy():
"""Tell NDB to never cache anything in memcache or in-process.
This ensures that entities fetched from Datastore input_readers via NDB
will not bloat up the request memory size and Datastore Puts will avoid
doing calls to memcache. Without this you get soft memory limit exits,
which hurts overall throughput.
"""
ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key: False)
ndb_ctx.set_memcache_policy(lambda key: False)
You can force get_by_id() and put() to use memcache, eg:
product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)
Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:
ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)
As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).
It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.
As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.
You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext

Google App Engine ndb performance on repeated property

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.
No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

Appengine NDB: Putting 880 rows, exceeding datastore write ops quota. Why?

I have an application which imports 880 rows into an NDB datastore, using put_async(). Whenever I run this import it exceeds the daily quota of 50,000 write ops to the datastore.
I'm trying to understand why this operation is so expensive and what can be done to stay under quota.
There are 13 columns like so:
stringbool = ['true', 'false']
class BeerMenu(ndb.Model):
name = ndb.StringProperty()
brewery = ndb.StringProperty()
origin = ndb.StringProperty()
abv = ndb.FloatProperty()
size = ndb.FloatProperty()
meas = ndb.StringProperty()
price = ndb.FloatProperty()
active = ndb.StringProperty(default="false", choices=stringbool)
url = ndb.StringProperty()
bartender = ndb.StringProperty()
lineno = ndb.IntegerProperty()
purdate = ndb.DateProperty()
costper = ndb.FloatProperty()
I've trimmed the indexing back to one:
- kind: BeerMenu
properties:
- name: brewery
- name: name
According to the SDK datastore viewer, each row is 29 write ops, so that would generate 25520 writes! I'm assuming that the indexes consume the rest of the write ops, but I don't know exactly how many because AppEngine just says I've exceeded the quota.
What are the best strategies for reducing the number of write ops?
All properties except text and blob properties are indexed by default. So if you deindex the string properties, all the float, int, and date properties are still indexed. You should add indexed=False to the other properties to decrease writes.
Indexes listed in index.yaml are additional indexes to the property indexes. index.yaml indexes are for things like ordered and relational queries (i.e., a query with date > date_property will generate an entry in index.yaml).
Here, check out the Costs for Datastore Calls part to give you more idea:
Paid Apps: Budgeting, Billing, and Buying Resources
Hope this helps too.

Google App Engine Query (not filter) for children of an entity

Are the children of an entity available in a Query?
Given:
class Factory(db.Model):
""" Parent-kind """
name = db.StringProperty()
class Product(db.Model):
""" Child kind, use Product(parent=factory) to make """
#property
def factory(self):
return self.parent()
serial = db.IntegerProperty()
Assume 500 factories have made 500 products for a total of 250,000 products. Is there a way to form a resource-efficient query that will return just the 500 products made by one particular factory? The ancestor method is a filter, so using e.g. Product.all().ancestor(factory_1) would require repeated calls to the datastore.
Although ancestor is described as a "filter", it actually just updates the query to add the ancestor condition. You don't send a request to the datastore until you iterate over the query, so what you have will work fine.
One minor point though: 500 entities with the same parent can hurt scalability, since writes are serialized to members of an entity group. If you just want to track the factory that made a product, use a ReferenceProperty:
class Product(db.Model):
factory = db.ReferenceProperty(Factory, collection_name="products")
You can then get all the products by using:
myFactory.products

Resources