In Google App Engine, how to I reduce memory consumption as I write a file out to the blobstore rather than exceed the soft memory limit? - google-app-engine

I'm using the blobstore to backup and recovery entities in csv format. The process is working well for all of my smaller models. However, once I start to work on models with more than 2K entities, I am exceeded the soft memory limit. I'm only fetching 50 entities at a time and then writing the results out to the blobstore, so I'm not clear why my memory usage would be building up. I can reliably make the method fail just by increasing the "limit" value passed in below which results in the method running just a little longer to export a few more entities.
Any recommendations on how to optimize this process to reduce memory consumption?
Also, the files produced will only <500KB in size. Why would the process use 140 MB of memory?
Simplified example:
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
writer = csv.DictWriter(f, fieldnames=properties)
for entity in models.Player.all():
row = backup.get_dict_for_entity(entity)
writer.writerow(row)
Produces the error:
Exceeded soft private memory limit with 150.957 MB after servicing 7 requests total
Simplified example 2:
The problem seems to be with using files and the with statement in python 2.5. Factoring out the csv stuff, I can reproduce almost the same error by simply trying to write a 4000 line text file to the blobstore.
from __future__ import with_statement
from google.appengine.api import files
from google.appengine.ext.blobstore import blobstore
file_name = files.blobstore.create(mime_type='application/octet-stream')
myBuffer = StringIO.StringIO()
#Put 4000 lines of text in myBuffer
with files.open(file_name, 'a') as f:
for line in myBuffer.getvalue().splitlies():
f.write(line)
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
Produces the error:
Exceeded soft private memory limit with 154.977 MB after servicing 24 requests total
Original:
def backup_model_to_blobstore(model, limit=None, batch_size=None):
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Open the file and write to it
with files.open(file_name, 'a') as f:
#Get the fieldnames for the csv file.
query = model.all().fetch(1)
entity = query[0]
properties = entity.__class__.properties()
#Add ID as a property
properties['ID'] = entity.key().id()
#For debugging rather than try and catch
if True:
writer = csv.DictWriter(f, fieldnames=properties)
#Write out a header row
headers = dict( (n,n) for n in properties )
writer.writerow(headers)
numBatches = int(limit/batch_size)
if numBatches == 0:
numBatches = 1
for x in range(numBatches):
logging.info("************** querying with offset %s and limit %s", x*batch_size, batch_size)
query = model.all().fetch(limit=batch_size, offset=x*batch_size)
for entity in query:
#This just returns a small dictionary with the key-value pairs
row = get_dict_for_entity(entity)
#write out a row for each entity.
writer.writerow(row)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
return blob_key
The error looks like this in the logs
......
2012-02-02 21:59:19.063
************** querying with offset 2050 and limit 50
I 2012-02-02 21:59:20.076
************** querying with offset 2100 and limit 50
I 2012-02-02 21:59:20.781
************** querying with offset 2150 and limit 50
I 2012-02-02 21:59:21.508
Exception for: Chris (202.161.57.167)
err:
Traceback (most recent call last):
.....
blob_key = backup_model_to_blobstore(model, limit=limit, batch_size=batch_size)
File "/base/data/home/apps/singpath/163.356548765202135434/singpath/backup.py", line 125, in backup_model_to_blobstore
writer.writerow(row)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 281, in __exit__
self.close()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 275, in close
self._make_rpc_call_with_retry('Close', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 388, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 236, in _make_call
_raise_app_error(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 179, in _raise_app_error
raise FileNotOpenedError()
FileNotOpenedError
C 2012-02-02 21:59:23.009
Exceeded soft private memory limit with 149.426 MB after servicing 14 requests total

You'd be better off not doing the batching yourself, but just iterating over the query. The iterator will pick a batch size (probably 20) that should be adequate:
q = model.all()
for entity in q:
row = get_dict_for_entity(entity)
writer.writerow(row)
This avoids re-running the query with ever-increasing offset, which is slow and causes quadratic behavior in the datastore.
An oft-overlooked fact about memory usage is that the in-memory representation of an entity can use 30-50 times the RAM compared to the serialized form of the entity; e.g. an entity that is 3KB on disk might use 100KB in RAM. (The exact blow-up factor depends on many factors; it's worse if you have lots of properties with long names and small values, even worse for repeated properties with long names.)

In What is the proper way to write to the Google App Engine blobstore as a file in Python 2.5 a similar problem was reported. In an answer there it is suggested that you should try inserting gc.collect() calls occasionally. Given what I know of the files API's implementation I think that is spot on. Give it a try!

I can't speak for the memory use in Python, but considering your error message, the error most likely stems from the fact that a blobstore backed file in GAE can't be open for more than around 30 seconds so you have to close and reopen it periodically if your processing takes longer.

It can possibly be a Time Exceed error, due to the limitation of the request to 30 secs. In my implementation in order to bypass it instead of having a webapp handler for the operation I am firing an event in the default queue. The cool thing about the queue is that it takes one line of code to invoke it, it has a 10 min time limit and if a task fails it retries before the time limit. I am not really sure if it will solve your problem but it worths giving a try.
from google.appengine.api import taskqueue
...
taskqueue.add("the url that invokes your method")
you can find more info about the queues here.
Or consider using a backend for serious computations and file operations.

Related

how to buffer a batch of data in flink

I want to buffer a datastream in flink. My initial idea is caching 100 pieces of data into a list or tuple and then using insert into values (???) to insert data into clickhouse in bulk. Do you have better ways to do this?
The first solution that you post works but it is flaky. It can lead to starvation due to a simplistic logic. For instance, let's say that you have a counter of 100 to create a batch. It is possible that your stream never receives 100 events, or it takes hours to receive the 100th event. Then your basic and working solution can have events stuck in the window batch because it is a count window. In other words, your batch can generate windows of 30 seconds in a high throughput, or windows of 1 hour when your throughput is very low.
DataStream<User> stream = ...;
DataStream<Tuple2<User, Long>> stream1 = stream
.countWindowAll(100)
.process(new MyProcessWindowFunction());
In general, it depends on your use case. However, I would use a time window to make sure that my job always has the flush batch even though there are few or no events on the window.
DataStream<Tuple2<User, Long>> stream1 = stream
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.process(new MyProcessWindowFunction());;
Thanks for all the answers. I use a window function to solve this problem.
SingleOutputStreamOperator<ArrayList<User>> stream2 =
stream1.countWindowAll(batchSize).process(new MyProcessWindowFunction());
Then I overwrite the process function in which the batch size of data is buffered in an ArrayList.
If you want to import data into the database in batches, you can use the window(countWindow or timeWindow)to aggregate the data.

Understanding Datastore Get RPCs in Google App Engine

I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.
Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.
When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.

Which NDB query function is more efficient to iterate through a big set of query results?

I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break

Google App Engine timeout: The datastore operation timed out, or the data was temporarily unavailable

This is a common exception I'm getting on my application's log daily, usually 5/6 times a day with a traffic of 1K visits/day:
db error trying to store stats
Traceback (most recent call last):
File "/base/data/home/apps/stackprinter/1b.347728306076327132/app/utility/worker.py", line 36, in deferred_store_print_statistics
dbcounter.increment()
File "/base/data/home/apps/stackprinter/1b.347728306076327132/app/db/counter.py", line 28, in increment
db.run_in_transaction(txn)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 1981, in RunInTransaction
DEFAULT_TRANSACTION_RETRIES, function, *args, **kwargs)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 2067, in RunInTransactionCustomRetries
ok, result = _DoOneTry(new_connection, function, args, kwargs)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 2105, in _DoOneTry
if new_connection.commit():
File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1585, in commit
return rpc.get_result()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 530, in get_result
return self.__get_result_hook(self)
File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1613, in __commit_hook
raise _ToDatastoreError(err)
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The function that is raising the exception above is the following one:
def store_printed_question(question_id, service, title):
def _store_TX():
entity = Question.get_by_key_name(key_names = '%s_%s' % \
(question_id, service ) )
if entity:
entity.counter = entity.counter + 1
entity.put()
else:
Question(key_name = '%s_%s' % (question_id, service ),\
question_id ,\
service,\
title,\
counter = 1).put()
db.run_in_transaction(_store_TX)
Basically, the store_printed_question function check if a given question was previously printed, incrementing in that case the related counter in a single transaction.
This function is added by a WebHandler to a deferred worker using the predefined default queue that, as you might know, has a throughput rate of five task invocations per second.
On a entity with six attributes (two indexes) I thought that using transactions regulated by a deferred task rate limit would allow me to avoid datastore timeouts but, looking at the log, this error is still showing up daily.
This counter I'm storing is not so much important, so I'm not worried about getting these timeouts; anyway I'm curious why Google App Engine can't handle this task properly even at a low rate like 5 tasks per second and if lowering the rate could be a possible solution.
A sharded counter on each question to avoid timeouts would be an overkill to me.
EDIT:
I have set the rate limit to 1 task per second on the default queue; I'm still getting the same error.
A query can only live for 30 seconds. See my answer to this question for some sample code to break a query up using cursors.
Generally speaking, a timeout like this is usually because of write contention. If you've got a transaction going and you're writing a bunch of stuff to the same entity group concurrently, you run into write contention issues (a side effect of optimistic concurrency). In most cases, if you make your entity group smaller, that will usually minimize this problem.
In your specific case, based on the code above, it's most probably because you should be using a sharded counter to avoid stacking of serialized writes.
Another far less likely possibility (mentioned here only for completeness) is that the tablet your data is on is being moved.

Query response size limit on appengine?

Appengine docs mention a 1Mb limit on both entity size and batch get requests (db.get()):
http://code.google.com/appengine/docs/python/datastore/overview.html
Is there also a limit on the total size of all entities returned by a query for a single fetch() call?
Example query:
db.Model.all().fetch(1000)
Update: As of 1.4.0 batch get limits have been removed!
Size and quantity limits on datastore batch get/put/delete operations have
been removed. Individual entities are still limited to 1 MB, but your app
may
batch as many entities together for get/put/delete calls as the overall
datastore deadline will allow for.
Theres no longer a limit on the number of entities that can be returned by a query, but the same entity size limit applies when you are actually retrieving / iterating over the entities. This will only be on a single entity at a time though; it is not a limit on the total size of all entities returned by the query.
Bottom line: as long as you don't have a single entity that is > 1Mb you should be OK with queries.
I tried it out on production and you can indeed exceed 1 Mb total for a query. I stopped testing at around 20 Mb total response size.
from app import models
# generate 1Mb string
a = 'a'
while len(a) < 1000000:
a += 'a'
# text is a db.TextProperty()
c = models.Comment(text=a)
c.put()
for c in models.Comment.all().fetch(100):
print c
Output:
<app.models.Comment object at 0xa98f8a68a482e9f8>
<app.models.Comment object at 0xa98f8a68a482e9b8>
<app.models.Comment object at 0xa98f8a68a482ea78>
<app.models.Comment object at 0xa98f8a68a482ea38>
....
Yes there is a size limit; the quotas and limits section explicitly states there is a 1 megabyte limit to db API calls.
You will not be able to db.get(list_of_keys) if the total size of the entities in the batch is over 1 megabyte. Likewise, you will not be able to put a batch if the total size of the entities in the batch is over 1 megabyte.
The 1,000 entity limit has been removed, but (at present) you will need to ensure the total size of your batches is less than 1 megabyte yourself.

Resources