Imagine I add a new user in the datastore. I have to add 200 rows for him (they just contain zeros). But it might take 40 seconds. The real user that has registered for my website has to wait this time before he proceeds. In MySQL it takes fractions of a second. What do you suggest?
Consider this code. It takes 10 seconds on the Google servers, which is still too slow.
def get(self):
class Movie(ndb.Model):
title = ndb.StringProperty (required=True)
rating = ndb.IntegerProperty (required=True)
#classmethod
def populate(cls, n):
for i in range(n):
o = cls(title='foo', rating=5)
o.put()
t1 = datetime.datetime.now()
Movie.populate(200)
t2 = datetime.datetime.now()
self.response.write(t2 - t1) # ~10 seconds
As noted in the comment - instead of saving entities one-by-one, create a list of entities and save them with multi-put.
I would suggest using a more sensible data model, frankly. There's no reason at all to create a model with 200 fields. Not only will the initial setup take ages, but loading each instance will be expensive, and saving will be exceedingly expensive.
In any case, you almost certainly don't need to instantiate all the fields from the start.
(Also, I must say that even with 200 fields, taking 40 seconds to save seems extremely unlikely. You are probably doing something strange, but without seeing any code it's impossible to tell.)
Related
I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github and Google's examples are either oversimplified (memcached frontpage counter with cron) or overkill (accurate sharded counter). But most importantly, no appengine-counter solution discussed elsewhere includes a time component (hourly, daily counts) needed for statistics.
Requirements: The system does not need to be 100% accurate and could just ignore memcache loss (if infrequent). This should simplify things considerably. The idea is to just use memcache and accumulate stats in time intervals.
UseCase: Users on your system create content (e.g. pages). You want to track approx. How often a user's pages are viewed per hour or day. Some pages are viewed often, some never. You want to query by user and timeframe. Subpages may have fixed IDs (Query for user with most hits on homepage). You may want to delete old entries (Query for entries of year=xxxx).
class StatisticsDB(ndb.Model):
# key.id() = something like YYYY-MM-DD-HH_groupId_countableID ... contains date
# timeframeId = ndb.StringProperty() YYYY-MM-DD-HH needed for cleanup if counter uses ancestors
countableId = ndb.StringProperty(required=True) # name of counter within group
groupId = ndb.StringProperty() # counter group (allows single DB query with timeframe prefix inequality)
count = ndb.Integerproperty() # count per specified timeframe
#classmethod
def increment(class, groupID, countableID):
# increment memcache
# save hourly to DB (see below)
Note: groupId and countableId indexes are necessary to avoid 2 inequalities in queries. (query all countables of a groupId/userId and chart/highcount-query: countableId with highest counts derives groupId/user), using ancestors in the DB may not support chart queries.
The problem is how to best save the memcached counter to DB:
cron: This approach is mentioned in example docs (example front-page counter), but uses fixed counter ids that are hardcoded in the cron-handler. As there is no prefix-query for existing memcache keys, determining which counter-ids were created in memcache during the last time interval and need to be saved is probably the bottleneck.
task-queue: if a counter is created schedule a task to collect it and write it to DB. COST: 1 task-queue entry per used counter and one ndb.put per time granularity (e.g. 1 hour) when the queue-handler saves the data. Seems the most promising approach to also capture infrequent events accurately.
infrequently when increment(id) executes: If a new timeframe starts, save the previous. This needs at least 2 memcache accesses (get date, incr counter) per increment. One for tracking the timeframe and one for the counter. Disadvantage: bursty counters with longer stale periods may loose the cache.
infrequently when increment(id) executes: probabilistic: if random % 100 == 0 then save to DB, but the counter should have uniformly distributed counting events
infrequently when increment(id) executes: if the counter reaches e.g. 100 then save to DB
Did anyone solve this problem, what would be a good way to design this?
What are the weaknesses and strengths of each approach?
Are there alternate approaches that are missing here?
Assumptions: Counting can be slightly inaccurate (cache loss), the counterID space is large, counterIDs are incremented sparesly (some once per day, some often per day)
Update: 1) I think cron can be used similar to the task queue. One only has to create the DB model of the counter with memcached=True and run a query in cron for all counters marked that way. COST: 1 put at 1st increment, query at cron, 1 put to update counter. Without thinking it through fully this appears slightly more costly/complex than the task approach.
Discussed elsewhere:
High concurrency non-sharded counters - no count per timeframe
Open Source GAE Fast Counters - no count per timeframe, nice performance comparison to sharded solution, expected losses due to memcache loss reported
Yep, your #2 idea seems to best address your requirements.
To implement it you need a task execution with a specified delay.
I used the deferred library for such purpose, using the deferred.defer()'s countdown argument. I learned in the meantime that the standard queue library has similar support, by specifying the countdown argument for a Task constructor (I have yet to use this approach, tho).
So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:
get the memcache counter value using the key from the task payload
add the value to the corresponding db counter
delete the memcache counter when the db update is successful
You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:
The delayed task:
reads the memcache counter value
enqueues another (transactional) task (with no delay) for adding the value to the db counter
deletes the memcache counter
The non-delayed task is now idempotent and can be safely re-tried until successful.
The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.
Update:
The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional countdown or eta arguments to taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
Counting things in a distributed system is a hard problem. There's some good info on the problem from the early days of App Engine. I'd start with Sharding Counter, which, despites being written in 2008, is still relevant.
Here is the code for the implementation of the task-queue approach with hourly timeframe. Interestingly it works without transactions and other mutex magic. (For readability the python indent of methods is wrong.)
Supporting priorities for memcache would increase accuracy of this solution.
TASK_URL = '/h/statistics/collect/' # Example: '/h/statistics/collect/{counter-id}"?groupId=" + groupId + "&countableId=" + countableId'
MEMCACHE_PREFIX = "StatisticsDB_"
class StatisticsDB(ndb.Model):
"""
Memcached counting saved each hour to DB.
"""
# key.id() = 2016-01-31-17_groupId_countableId
countableId = ndb.StringProperty(required=True) # unique name of counter within group
groupId = ndb.StringProperty() # couter group (allows single DB query for group of counters)
count = ndb.IntegerProperty(default=0) # count per timeframe
#classmethod
def increment(cls, groupId, countableId): # throws InvalidTaskNameError
"""
Increment a counter. countableId is the unique id of the countable
throws InvalidTaskNameError if ids do not match: [a-zA-Z0-9-_]{1,500}
"""
# Calculate memcache key and db_key at this time
# the counting timeframe is 1h, determined by %H, MUST MATCH ETA calculation in _add_task()
counter_key = datetime.datetime.utcnow().strftime("%Y-%m-%d-%H") + "_" + groupId +"_"+ countableId;
client = memcache.Client()
n = client.incr(MEMCACHE_PREFIX + counter_key)
if n is None:
cls._add_task(counter_key, groupId, countableId)
client.incr(MEMCACHE_PREFIX + counter_key, initial_value=0)
#classmethod
def _add_task(cls, counter_key, groupId, countableId):
taskurl = TASK_URL + counter_key + "?groupId=" + groupId + "&countableId=" + countableId
now = datetime.datetime.now()
# the counting timeframe is 1h, determined by counter_key, MUST MATCH ETA calculation
eta = now + datetime.timedelta(minutes = (61-now.minute)) # at most 1h later, randomized over 1 minute, throttled by queue parameters
task = taskqueue.Task(url=taskurl, method='GET', name=MEMCACHE_PREFIX + counter_key, eta=eta)
queue = taskqueue.Queue(name='StatisticsDB')
try:
queue.add(task)
except taskqueue.TaskAlreadyExistsError: # may also occur if 2 increments are done simultaneously
logging.warning("StatisticsDB TaskAlreadyExistsError lost memcache for %s", counter_key)
except taskqueue.TombstonedTaskError: # task name is locked for ...
logging.warning("StatisticsDB TombstonedTaskError some bad guy ran this task premature manually %s", counter_key)
#classmethod
def save2db_task_handler(cls, counter_key, countableId, groupId):
"""
Save counter from memcache to DB. Idempotent method.
At the time this executes no more increments to this counter occur.
"""
dbkey = ndb.Key(StatisticsDB, counter_key)
n = memcache.get(MEMCACHE_PREFIX + counter_key)
if n is None:
logging.warning("StatisticsDB lost count for %s", counter_key)
return
stats = StatisticsDB(key=dbkey, count=n, countableId=countableId, groupId=groupId)
stats.put()
memcache.delete(MEMCACHE_PREFIX + counter_key) # delete if put succeeded
logging.info("StatisticsDB saved %s n = %i", counter_key, n)
I am looking for some help as to the best way to structure data in app engine ndb using python, process it and query it later. I want to store temperature data at hourly intervals for different geographical regions.
I can think of two entity options but there maybe something much better. The first would be to store the hourly temperature in individual properties:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
00:00 = ndb.FloatProperty()
01:00 = ndb.FloatProperty()
...
23:00 = ndb.FloatProperty()
Or I could store the data
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
time = ndb.TimeProperty()
temp = ndb.FloatProperty()
(it might be better to store date and time as one property?)
I want to be able to query the datastore to calculate the Total, Max, Min, and average temperature for any given date range. In the first option I could potentially create 4 more properties to effectively pre-process and store the Total, Max etc for each day so if I wanted to query the total temperature for a year I would only have to sum 365 values as opposed to 8760? I'm not sure how I would do this in the second option?
I am relatively new to app engine and datastore and I think I am still thinking in terms of relationship db's so any help would really be appreciated. Later on it might be necessary to store data in different time zones.
Thanks
Paul
Personally, I'd go with a variant of the first approach:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
temp = ndb.FloatProperty(repeated=True)
using the temp list to store temperatures by hour in order as you learn about them. I don't think the preprocessing per-date will add anything much: to compute whatever for a year, you'd still need to fetch 365 entities, and the delay for that will swamp the tiny amount of time required to sum up a few thousand numbers anyway.
In general, preprocessing is useful if you want to handily query by the new fields you create by such processing (e.g rapidly answer the question "which dates in locale X had average temperatures greater than 20 Celsius"). That does not seem to be your use case.
If anything, if it's common for you to have to compute many-month values, preprocessing to aggregate things per-month (into simpler TempDataMonth entities) may be more useful. Or, any other several-days period you find useful, of course (weeks, ten-day-groups, whatever). Those could be computed in a background task periodically checking which such periods have become complete since the last check. But, this is a bit beyond your question, so I'm not getting into fine-grained details.
The general idea is that minimizing the number of entities to fetch tends to be the single most important optimization; other optimizations are of course also possible, but, they tend to play second fiddle to that:-).
Here is the situation
I have a model like
class Content(ndb.Model):
likeCount=ndb.IntegerProperty(default=0)
likeUser = ndb.KeyProperty(kind=User, repeated=True)
When a new content generate than a new "Content" object is generated like
content_obj_key = Content(parent=objContentData.key, #Where ContentData is another ndb.Model subclass
likeUser=[],
likeCount=0
).put()
And when any user like the same content than below function gets called
def modify_like(contentData_key, user_key):
like_obj = Content.query(parent=contetData_key).get()
if like_obj:
like_obj.likeUser.append(user_key)
like_obj.likeCount += 1
like_obj.put()
Problem:
Now the problem is that when at the same time more than 4 user like the same content than this object write wrong data.
I mean lets say userA, userB, userC and userD like this data at the same time and currently only userE liked the same content.
So after all new four write the "likeCount" is not 5, always less than 5 and the "likeUser" list length is also less than 5.
So how can I solve this problem?
So that always data remains consistent.
It may be that some of the updates are stepping on each other, since several users may be incrementing the same count value at the same time.
If userA and userB get the Content object at the same time - both having the same count value (likeCount=1). Then both increment to value of 2 - when it should be total of 3.
One possible solution is to use sharding. This is useful when entities in your application may have a lot of writes. The count is the total of all shards for that entity. Example code from the documentation:
NUM_SHARDS = 5
class SimpleCounterShard(ndb.Model):
"""Shards for the counter"""
count = ndb.IntegerProperty(default=0)
def get_count():
"""Retrieve the value for a given sharded counter.
Returns:
Integer; the cumulative count of all sharded counters.
"""
total = 0
for counter in SimpleCounterShard.query():
total += counter.count
return total
#ndb.transactional
def increment():
"""Increment the value for a given sharded counter."""
shard_string_index = str(random.randint(0, NUM_SHARDS - 1))
counter = SimpleCounterShard.get_by_id(shard_string_index)
if counter is None:
counter = SimpleCounterShard(id=shard_string_index)
counter.count += 1
counter.put()
More info and examples on sharding counters can be found at:
https://cloud.google.com/appengine/articles/sharding_counters
Use transactions:
#ndb.transactional(retries=10)
def modify_like(contentData_key, user_key):
...
A transaction is an operation or set of operations that are guaranteed
to be atomic, which means that transactions are never partially
applied. Either all of the operations in the transaction are applied,
or none of them are applied. Transactions have a maximum duration of
60 seconds with a 10 second idle expiration time after 30 seconds.
My GAE app will request weekly data from Google Analytics like
number of visitors during last week
number of visitors of particular page during last week
etc.
Then I would like to show this data on my GAE web-page with Google Charts. The data will be shown for last X weeks (let's say, 10 weeks).
What is the best approach to store this data (number of metrics multiplied by number of weeks)? Old data could be deleted.
I don't think I should use datastore like:
class Visitors(ndb.Model):
week1 = ndb.IntegerProperty(default=0) # should store week start and end dates also
week2 = ndb.IntegerProperty(default=0)
...
Probably, it would be better to store data like:
class Analytics(ndb.Model):
visitors = ndb.StringProperty(default=0) # comma separated values like '1000,1001,1002'; last value is previous week
page_visitors = ndb.IntegerProperty(repeated=True,default=0) # [1000,1001,1002]
...
What are you trying to optimize?
With this amount of data, you will pay pennies, or less, for data storage. You are well within the free quota on datastore reads and writes. Performance-wise, the difference is negligible.
I would recommend going with the most straightforward solution: each week is a new entity, each data point is in its own property.
So I am currently performing a test, to estimate how much can my Google app engine work, without going over quotas.
This is my test:
I have in the datastore an entity, that according to my local
dashboard, needs 18 write operations. I have 5 entries of this type
in a table.
Every 30 seconds, I fetch those 5 entities mentioned above. I DO
NOT USE MEMCACHE FOR THESE !!!
That means 5 * 18 = 90 read operations, per fetch right ?
In 1 minute that means 180, in 1 hour that means 10800 read operations..Which is ~20% of the daily limit quota...
However, after 1 hour of my test running, I noticed on my online dashboard, that only 2% of the read operations were used...and my question is why is that?...Where is the flaw in my calculations ?
Also...where can I see in the online dashboard how many read/write operations does an entity need?
Thanks
A write on your entity may need 18 writes, but a get on your entity will cost you only 1 read.
So if you get 5 entries every 30 secondes during one hour, you'll have about 5reads * 120 = 600 reads.
This is in the case you make a get on your 5 entries. (fetching the entry with it's id)
If you make a query to fetch them, the cost is "1 read + 1 read per entity retrieved". Wich mean 2 reads per entries. So around 1200 reads in one hour.
For more details informations, here is the documentation for estimating costs.
You can't see on the dashboard how many writes/reads operations an entity need. But I invite you to check appstats for that.