In the video/PDF from "Data pipelines with Google App Engine" Brett puts "now / 30" into the task name noting that he will explain the reason later, but somehow he never does. :)
http://www.youtube.com/watch?v=zSDC_TU7rtc#t=41m35
task_name = '%s-%d-%d' % (sum_name, int(now / 30), index)
Do you have any idea about the reason? Does it have anything to do with the 7 day period in which one can't re-use task names?
Link to the session page
Brett Slatkin's own explanation
[Brett]
Hey all,
The int(time.time()/30) part of the task name is to prevent queue stalls. When memcache gets evicted the work index counter will be reset to zero. That means new fork-join work items may insert tasks that are named the same as tasks that were already inserted. By including a time window of ~30 seconds in the task name, we ensure that this problem can only last for about thirty seconds. This is also why you should raise an exception when you see a TombstonedTaskError exception.
Worst-case scenario if the clocks are wonky is that two tasks are run to do the fan-in work instead of just one, which is an acceptable trade-off in many cases and a fundamental possibility when using the task queue API. This can be mitigated using pigeon-hole acknowledgment entities, like I use in my materialized view example.
Hope that helps,
[/Brett]
Related
I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.
Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.
When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.
I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break
I am using the simple control.start_map() function of the appengine-mapreduce library to start a mapreduce job. This job successfully completes and shows ~43M mapper-calls on the resulting /mapreduce/detail?mapreduce_id=<my_id> page. However, this page makes no mention of the reduce step or any of the underlying appengine-pipeline processes that I believe are still running. Is there some way to return the pipeline ID that this calls makes so I can look at the underlying pipelines to help debug this long-running job? I would like to retrieve enough information to pull up this page: /mapreduce/pipeline/status?root=<guid>
Here is an example of the code I am using to start up my mapreduce job originally:
from third_party.mapreduce import control
mapreduce_id = control.start_map(
name="Backfill",
handler_spec="mark_tos_accepted",
reader_spec=(
"third_party.mapreduce.input_readers.DatastoreInputReader"),
mapper_parameters={
"input_reader": {
"entity_kind": "ModelX"
},
},
shard_count=64,
queue_name="backfill-mapreduce-queue",
)
Here is the mapping function:
# This is where we keep our copy of appengine-mapreduce
from third_party.mapreduce import operation as op
def mark_tos_accepted(modelx):
# Skip users who have already been marked
if (not modelx
or modelx.tos_accepted == myglobals.LAST_MATERIAL_CHANGE_TO_TOS):
return
modelx.tos_accepted = user_models.LAST_MATERIAL_CHANGE_TO_TOS
yield op.db.Put(modelx)
Here are the relevant portions of the ModelX:
class BackupModel(db.Model):
backup_timestamp = db.DateTimeProperty(indexed=True, auto_now=True)
class ModelX(BackupModel):
tos_accepted = db.IntegerProperty(indexed=False, default=0)
For more context, I am trying to debug a problem I am seeing with writes showing up in our data warehouse.
On 3/23/2013, we launched a MapReduce job (let's call it A) over a db.Model (let's call it ModelX) with ~43M entities. 7 hours later, the job "finished" and the /mapreduce/detail page showed that we had successfully mapped over all of the entities, as shown below.
mapper-calls: 43613334 (1747.47/sec avg.)
On 3/31/2013, we launched another MapReduce job (let's call it B) over ModelX. 12 hours later, the job finished with status Success and the /mapreduce/detail page showed that we had successfully mapped over all of the entities, as shown below.
mapper-calls: 43803632 (964.24/sec avg.)
I know that MR job A wrote to all ModelX entities, since we introduced a new property that none of the entities contained before. The ModelX contains an auto_add property like so.
backup_timestamp = ndb.DateTimeProperty(indexed=True, auto_now=True)
Our data warehousing process runs a query over ModelX to find those entities that changed on a certain day and then downloads those entities and stores them in a separate (AWS) database so that we can run analysis over them. An example of this query is:
db.GqlQuery('select * from ModelX where backup_timestamp >= DATETIME(2013, 4, 10, 0, 0, 0) and backup_timestamp < DATETIME(2013, 4, 11, 0, 0, 0) order by backup_timestamp')
I would expect that our data warehouse would have ~43M entities on each of the days that the MR jobs completed, but it is actually more like ~3M, with each subsequent day showing an increase, as shown in this progression:
3/16/13 230751
3/17/13 193316
3/18/13 344114
3/19/13 437790
3/20/13 443850
3/21/13 640560
3/22/13 612143
3/23/13 547817
3/24/13 2317784 // Why isn't this ~43M ?
3/25/13 3701792 // Why didn't this go down to ~500K again?
3/26/13 4166678
3/27/13 3513732
3/28/13 3652571
This makes me think that although the op.db.Put() calls issued by the mapreduce job are still running in some pipeline or queue and causing this trickle effect.
Furthermore, if I query for entities with an old backup_timestamp, I can go back pretty far and still get plenty of entities, but I would expect all of these queries to return 0:
In [4]: ModelX.all().filter('backup_timestamp <', 'DATETIME(2013,2,23,1,1,1)').count()
Out[4]: 1000L
In [5]: ModelX.all().filter('backup_timestamp <', 'DATETIME(2013,1,23,1,1,1)').count()
Out[5]: 1000L
In [6]: ModelX.all().filter('backup_timestamp <', 'DATETIME(2012,1,23,1,1,1)').count()
Out[6]: 1000L
However, there is this strange behavior where the query returns entities that it should not:
In [8]: old = ModelX.all().filter('backup_timestamp <', 'DATETIME(2012,1,1,1,1,1)')
In [9]: paste
for o in old[1:100]:
print o.backup_timestamp
## -- End pasted text --
2013-03-22 22:56:03.877840
2013-03-22 22:56:18.149020
2013-03-22 22:56:19.288400
2013-03-22 22:56:31.412290
2013-03-22 22:58:37.710790
2013-03-22 22:59:14.144200
2013-03-22 22:59:41.396550
2013-03-22 22:59:46.482890
2013-03-22 22:59:46.703210
2013-03-22 22:59:57.525220
2013-03-22 23:00:03.864200
2013-03-22 23:00:18.040840
2013-03-22 23:00:39.636020
Which makes me think that the index is just taking a long time to be updated.
I have also graphed the number of entities that our data warehousing downloads and am noticing some cliff-like drops that makes me think that there is some behind-the-scenes throttling going on somewhere that I cannot see with any of the diagnostic tools exposed on the appengine dashboard. For example, this graph shows a fairly large spike on 3/23, when we started the mapreduce job, but then a dramatic fall shortly thereafter.
This graph shows the count of entities returned by the BackupTimestamp GqlQuery for each 10-minute interval for each day. Note that the purple line shows a huge spike as the MapReduce job spins up, and then a dramatic fall ~1hr later as the throttling kicks in. This graph also shows that there seems to be some time-based throttling going on.
I don't think you'll have any reducer functions there, because all you've done is start a mapper. To do a complete mapreduce, you have to explicitly instantiate a MapReducePipeline and call start on it. As a bonus, that answers your question, as it returns the pipeline ID which you can then use in the status URL.
Just trying to understand the specific problem. Is it that you are expecting a bigger number of entities in your AWS database? I would suspect that the problem lies with the process that downloads your old ModelX entities into an AWS database, that it's somehow not catching all the updated entities.
Is the AWS-downloading process modifying ModelX in any way? If not, then why would you be surprised at finding entities with an old modified time stamp? modified would only be updated on writes, not on read operations.
Kind of unrelated - with respect to throttling I've usually found a throttled task queue to be the problem, so maybe check how old your tasks in there are or if your app is being throttled due to a large amount of errors incurred somewhere else.
control.start_map doesn't use pipeline and has no shuffle/reduce step. When the mapreduce status page shows its finished, all mapreduce related taskqueue tasks should have finished. You can examine your queue or even pause it.
I suspect there are problems related to old indexes for the old Model or to eventual consistency. To debug MR, it is useful to filter your warnings/errors log and search by the mr id. To help with your particular case, it might be useful to see your Map handler.
Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.
This is so weird...
First of all this query works in the datastore viewer, ie. it returns the correct row.
SELECT * FROM Level where short_id = 'Ec71eN'
But if I run this
Level.all().filter("short_id = ", 'Ec71eN').get()
it returns None, if I run this:
db.GqlQuery("SELECT * FROM Level where short_id = '%s'" % 'Ec71eN').get()
it also returns None. If I run this:
level = Level.get_by_id(189009)
it returns the correct row (189009 is the id for the correct row)
Puzzling? What can be wrong here? I have never seen anything like this before, it has worked correctly for at least a couple of weeks in production... I think I have at least two cases now where it dosent work starting today.
UPDATE: This can not be a eventually consistent problem since the row was 7 hours old when I tried the above. I had two rows with same symptoms, strangely booth generated by the same users. They where booth "fixed" after I did a manual fecth of their ids by uploading special case code like:
if short_id==CASE_1_SHORT_ID:
level = Level.get_by_id(CASE_1_ID)
After that the query worked as usual.
Are you using the HRD? Nothing's wrong. You know it's supposed to be eventually consistent right?
Query operations are eventually consistent.
Get-by-id operations are fully consistent.
What you describe is correct datastore behavior. It's a bit odd that the datastore viewer operation returns the correct result, but it might have hit a separate tablet on the datastore operation.
Given that it was created 7 hours ago, the 'eventual consistency' generally should take seconds to minutes.
If eventual consistency IS the problem, run the same query method a bunch of times and see if returns the same result. If it continuously returns the same result with the same method, then it is more than likely not an eventual consistency problem. You should switch to the NDB API for querying data as well - it's 1000 times better and Guido worked on it - so you know it's good. Does NDB show the same inconsistency?