Bulk Download via Google App Engine Backend

Bulk Download via Google App Engine Backend - google-app-engine

I have 1.6 Million entities in a Google App Engine app that I would like to download. I tried using the built in bulkloader mechanism but found that it is terribly slow. While I can only download ~30 entities/second via the bulkloader, I can do ~500 entities/second by querying the datastore via a backend. A backend is necessary to circumvent the 60 second request limit. In addition, datastore queries can only live for up to 30 seconds so you need to break up your fetches across multiple queries using query cursors.
The code on the server side fetches an 1000 entities and returns a query cursor:
cursor = request.get('cursor')
devices = Pushdev.all()
if (cursor and cursor!=''):
devices.with_cursor(cursor)
next1000 = devices.fetch(1000)
for d in next1000:
t = int(time.mktime(d.created.timetuple()))
response.out.write('%s/%s/%d\n'%(d.name,d.alias,t))
response.out.write(devices.cursor())
On the client side, I have a loop that invokes the handler on the server with a null cursor to begin with and then starts to pass the cursor received by the previous invocation. It terminates when it gets an empty result.
PROBLEM: I am only able to fetch a fraction - ~20% of the entities using this method. I get a response with empty data even though the full set of entities has not been traversed. Why does this method not fetch everything comprehensively?

I couldn't find anything to confirm or deny this in the docs, but my guess is that all() has a non-deterministic ordering such that eventually one of your fetch(1000)'s will hit the "last element" and devices.cursor() will return nothing.
Try this:
devices = Pushdev.all().order('__key__')

Related

How to prevent overwriting of database for requests from different instances (Google App Engine using NDB)

My Google App Engine application (Python3, standard environment) serves requests from users: if there is no wanted record in the database, then create it.
Here is the problem about database overwriting:
When one user (via browser) sends a request to database, the running GAE instance may temporarily fail to respond to the request and then it creates a new process to respond this request. It results that two instances respond to the same request. Both instances make a query to database almost in the same time, and each of them finds there is no wanted record and thus creates a new record. It results as two repeated records.
Another scenery is that for certain reason, the user's browser sends twice requests with time difference less than 0.01 second, which are processed by two instances at the server side and thus repeated records are created.
I am wondering how to temporarily lock the database by one instance to prevent the database overwriting from another instance.
I have considered the following schemes but have no idea whether it is efficient or not.
For python 2, Google App Engine provides "memcache", which can be used to mark the status of query for the purpose of database locking. But for python3, it seems that one has to setup a Redis server to rapidly exchange database status among different instances. So, how about the efficiency of database locking by using Redis?
The usage of session module of Flask. The session module can be used to share data (in most cases, the login status of users) among different requests and thus different instances. I am wondering the speed to exchange the data between different instances.
Appended information (1)
I followed the advice to use transaction, but it did not work.
Below is the code I used to verify the transaction.
The reason of failure may be that the transaction only works for CURRENT client. For multiple requests at the same time, the server side of GAE will create different processes or instances to respond to the requests, and each process or instance will have its own independent client.
#staticmethod
def get_test(test_key_id, unique_user_id, course_key_id, make_new=False):
client = ndb.Client()
with client.context():
from google.cloud import datastore
from datetime import datetime
client2 = datastore.Client()
print("transaction started at: ", datetime.utcnow())
with client2.transaction():
print("query started at: ", datetime.utcnow())
my_test = MyTest.query(MyTest.test_key_id==test_key_id, MyTest.unique_user_id==unique_user_id).get()
import time
time.sleep(5)
if make_new and not my_test:
print("data to create started at: ", datetime.utcnow())
my_test = MyTest(test_key_id=test_key_id, unique_user_id=unique_user_id, course_key_id=course_key_id, status="")
my_test.put()
print("data to created at: ", datetime.utcnow())
print("transaction ended at: ", datetime.utcnow())
return my_test
Appended information (2)
Here is new information about usage of memcache (Python 3)
I have tried the follow code to lock the database by using memcache, but it still failed to avoid overwriting.
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.gets(cache_key_id)
if result is None or result == "":
client.cas(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
print("failed after 500 tries.")
break
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.cas(cache_key_id, "")
memcache.delete(cache_key_id)

If the problem is duplication but not overwriting, maybe you should specify data id when creating new entries, but not let GAE generate a random one for you. Then the application will write to the same entry twice, instead of creating two entries. The data id can be anything unique, such as a session id, a timestamp, etc.
The problem of transaction is, it prevents you modifying the same entry in parallel, but it does not stop you creating two new entries in parallel.

I used memcache in the following way (using get/set ) and succeeded in locking the database writing.
It seems that gets/cas does not work well. In a test, I set the valve by cas() but then it failed to read value by gets() later.
Memcache API: https://cloud.google.com/appengine/docs/standard/python3/reference/services/bundled/google/appengine/api/memcache
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.get(cache_key_id)
if result is None or result == "":
client.set(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
return "failed after 500 tries of memcache checking."
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.delete(cache_key_id)
...

Transactions:
https://developers.google.com/appengine/docs/python/datastore/transactions
When two or more transactions simultaneously attempt to modify entities in one or more common entity groups, only the first transaction to commit its changes can succeed; all the others will fail on commit.
You should be updating your values inside a transaction. App Engine's transactions will prevent two updates from overwriting each other as long as your read and write are within a single transaction. Be sure to pay attention to the discussion about entity groups.
You have two options:
Implement your own logic for transaction failures (how many times to
retry, etc.)
Instead of writing to the datastore directly, create a task to modify
an entity. Run a transaction inside a task. If it fails, the App
Engine will retry this task until it succeeds.

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!

You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

FetchOptions withLimit() does not reduce query execution time (Google App Engine)

Problem
Running a datastore query with or without FetchOptions.Builder.withLimit(100) takes the same execution time! Why is that? Isn't the limit method intended to reduce the time to retrieve results!?
Test setup
I am locally testing the execution time of some datastore queries with Google's App Engine. I am using the Google Cloud SDK Standard Environment with the App Engine SDK 1.9.59.
For the test, I created an example entity with 5 indexed properties and 5 unindexed properties. I filled the datastore with 50.000 entries of a test entity. I run the following method to retrieve 100 of this entities by utilizing the withLimit() method.
public List<Long> getTestIds() {
List<Long> ids = new ArrayList<>();
FetchOptions fetchOptions = FetchOptions.Builder.withLimit(100);
Query q = new Query("test_kind").setKeysOnly();
for (Entity entity : datastore.prepare(q).asIterable(fetchOptions)) {
ids.add(entity.getKey().getId());
}
return ids;
}
I measure the time before and after calling this method:
long start = System.currentTimeMillis();
int size = getTestIds().size();
long end = System.currentTimeMillis();
log.info("time: " + (end - start) + " results: " + size);
I log the execution time and the number of returned results.
Results
When I do not use the withLimit() FetchOptions for the query, I get the expected 50.000 results in about 1740 ms. Nothing surprising here.
If I run the code as displayed above and use withLimit(100) I get the expected 100 results. However, the query runs about the same 1740 ms!
I tested with different numbers of datastore entries and different limits. Every time the queries with or without withLimit(100) took the same time.
Question
Why is the query still fetching all entities? I am sure the query is not supposed to get all entities even though the limit is set to 100 right? What am I missing? Is there some datastore configuration for that? After testing and searching the web for 4 days I still can't find the problem.

FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
See for example Datastore fetch VS fetch(keys_only=True) then get_multi (including comments)

Understanding Datastore Get RPCs in Google App Engine

I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.

Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.

When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.

App Engine query in admin datastore viewer returning different results than programmatic query

I'm flummoxed.
I noticed today that some data I thought should be present in my production appengine app wasn't showing up. I connected to the app via the remote console and ran the queries manually. Sure enough it looked like I only had 15 of the 101 rows I was expecting to see.
Then I went to my admin console at appengine.google.com and fired up the datastore viewer with the following query:
SELECT * FROM Assignment where game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')
The result I see is the first page of 20 results. I page through those results, and am able to see all 101 entities. HOORAY! My data is still there. BUT why then can't I access it via the db api? (NOTE: I've already tried clearing memcache via the memcache viewer, even though this particularly query isn't manually memcached)
From the remote console:
> from google.appengine.ext.db import GqlQuery
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
15
The remote console agrees with the app itself, which only seems to be able to see 15 of the expected 101 rows.
What gives?
UPDATE:
I suspect this might be an indexing issue. If I issue get_by_key_name for one of the missing rows, it subsequently shows up in db api queries.
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
15
> entities.Assignment.get_by_key_name('201212-assignment-135.9')
<entities.Assignment object at 0xa11eb6c>
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
16
So should I (or can I) rebuild my indexes to remedy this problem?
UPDATE #2:
I attempted to build a perfect index for this query, and have just verified that even when the query does use the just-built index (via query.index_list()), the results are still only limited to a small subset of items available via the datastore viewer. Infuriatingly, it's actually a different subset than is available with the previous index (20 items vs 15 items). So now adding an additional filter term results in an additional 5 rows returned. So dumb.
All indexes claim to be "serving" so there shouldn't be any reason that the indexes are this far off.
UPDATE #3:
Sometimes, using my new index, I'll get the right answer:
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb') and user = 'zee'").count()
101
However if I issue this query 10 times, it comes back with the 'bad' results about half the time:
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb') and user = 'zee'").count()
16
So maybe its an issue of a bad/behind bigtable replica that I'm hitting half the time, or something else completely opaque that we won't get an answer to (appengine status does list a service disruption today), but I have a feeling that this will be fixed on its own. Will update again if it does.
FINAL UPDATE:
As I suspected, when I woke up this morning my app (and manual queries) now see a consistent, correct view of the data. Would still love an answer as to why this happened, but until I get that I'm going to chalk it up to internal Google bigtable weirdness.
I filed this issue against appengine to see if I can get an answer from someone in the know.

For HRD applications, this is working as intended. App Engine High Replication Datastore (HRD) stores your data synchronously in multiple datacenters. However, the delay from the time a write is committed until it becomes visible in all datacenters means that queries across multiple entity groups (non-ancestor queries) can only guarantee eventually consistent results. [1]
In your specific case, the discrepancy between the results from your application and the Admin Console Datastore Viewer is just because they most likely are reading from different Datastore servers with different consistency.
If you require a consistent view of your data, I advise taking a closer look into the article "Structuring Data for Strong Consistency"
[1] https://developers.google.com/appengine/docs/java/datastore/structuring_for_strong_consistency

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight