parameters for mapreduce - google-app-engine

I am using Java mapreduce module of appengine
I get the following info message
Out of mapper quota. Aborting request until quota is replenished. Consider increasing mapreduce.mapper.inputprocessingrate (default 1000) if you would like your mapper job to complete faster.
Task parameters.
queue name = default
rate = 1/s
bucketsize = 1
I have about 2000 entities of the KIND, and I am just doing the logging in the map() call
What mapreduce/task parameters needs to be provided to get rid of that info message.
-Aswath

I believe that this is a special quota in mapreduce implemented by the framework itself. It's designed to limit the speed that it can consume resources, so that
mapreduce does not run through available app engine quota too quickly. It looks like it denotes the maximum overall rate of map() calls/second.
Try increasing the mapreduce.mapper.inputprocessingrate property in the configuration for your map job.
Or, just to test, you can change the default, defined in mapreduce/AppEngineJobContext.java:
public static final int DEFAULT_MAP_INPUT_PROCESSING_RATE = 1000;

Related

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

FetchOptions withLimit() does not reduce query execution time (Google App Engine)

Problem
Running a datastore query with or without FetchOptions.Builder.withLimit(100) takes the same execution time! Why is that? Isn't the limit method intended to reduce the time to retrieve results!?
Test setup
I am locally testing the execution time of some datastore queries with Google's App Engine. I am using the Google Cloud SDK Standard Environment with the App Engine SDK 1.9.59.
For the test, I created an example entity with 5 indexed properties and 5 unindexed properties. I filled the datastore with 50.000 entries of a test entity. I run the following method to retrieve 100 of this entities by utilizing the withLimit() method.
public List<Long> getTestIds() {
List<Long> ids = new ArrayList<>();
FetchOptions fetchOptions = FetchOptions.Builder.withLimit(100);
Query q = new Query("test_kind").setKeysOnly();
for (Entity entity : datastore.prepare(q).asIterable(fetchOptions)) {
ids.add(entity.getKey().getId());
}
return ids;
}
I measure the time before and after calling this method:
long start = System.currentTimeMillis();
int size = getTestIds().size();
long end = System.currentTimeMillis();
log.info("time: " + (end - start) + " results: " + size);
I log the execution time and the number of returned results.
Results
When I do not use the withLimit() FetchOptions for the query, I get the expected 50.000 results in about 1740 ms. Nothing surprising here.
If I run the code as displayed above and use withLimit(100) I get the expected 100 results. However, the query runs about the same 1740 ms!
I tested with different numbers of datastore entries and different limits. Every time the queries with or without withLimit(100) took the same time.
Question
Why is the query still fetching all entities? I am sure the query is not supposed to get all entities even though the limit is set to 100 right? What am I missing? Is there some datastore configuration for that? After testing and searching the web for 4 days I still can't find the problem.
FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
See for example Datastore fetch VS fetch(keys_only=True) then get_multi (including comments)

Understanding Datastore Get RPCs in Google App Engine

I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.
Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.
When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.

Which NDB query function is more efficient to iterate through a big set of query results?

I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break

Is there a count_async() equivalent for db models?

NB: I am using db (not ndb) here. I know ndb has a count_async() but I am hoping for a solution that does not involve migrating over to ndb.
Occasionally I need an accurate count of the number of entities that match a query. With db this is simply:
q = some Query with filters
num_entities = q.count(limit=None)
It costs a small db operation per entity but it gets me the info I need. The problem is that I often need to do a few of these in the same request and it would be nice to do them asynchronously but I don't see support for that in the db library.
I was thinking I could use run(keys_only=True, batch_size=1000) as it runs the query asynchronously and returns an iterator. I could first call run() on each query and then later count the results from each iterator. It costs the same as count() however run() has proven to be slower in testing (perhaps because it actually returns results) and in fact it seems that batch_size is limited at 300 regardless of how high I set it which requires more RPCs to do a count of thousands of entities than the count() method does.
My test code for run() looks like this:
queries = list of Queries with filters
iters = []
for q in queries:
iters.append( q.run(keys_only=True, batch_size=1000) )
for iter in iters:
count_entities_from(iter)
No, there's no equivalent in db. The whole point of ndb is that it adds these sort of capabilities which were missing in db.

Resources