UPDATED:
I do not have an entry in the index.yaml, expecting zig-zag to work fine on equalities (which it appears to in the datastore UI) and here is the actual query I'm making:
UploadBlock.query(
UploadBlock.time_slices == timeslice, # timeslice is each value in the repeated property
UploadBlock.company == block.company,
UploadBlock.aggregator_serial == block.aggregator_serial,
UploadBlock.instrument_serial == block.instrument_serial)\
.fetch(10, keys_only=True)
I've cross-posted this because I didn't read the initial support page first, so this is also on the google forum at https://groups.google.com/forum/#!topic/google-appengine/-7n8Y_tzCgM
I have a datastore query I make as an integral part of my system. It's an equality filter on four properties. When I make this query for one set of criteria, the query just hangs. It does not hang for this set of criteria in the datastore viewer, so I can see that there are only two results to return. I can fetch both of these entities by id without problem, but the query always hangs, both in remote shell and in production. The query is running fine for a multitude of different criteria, it appears to just be this one set. I have attempted replacing both entities so that the criteria still matches, but they have different ids and I get the same result.
For some more specific information, the four properties are defined as follows:
company = ndb.KeyProperty(indexed=True)
aggregator_serial = ndb.StringProperty(indexed=True)
instrument_serial = ndb.StringProperty(indexed=True)
time_slices = ndb.IntegerProperty(indexed=True, repeated=True)
I'm running gcloud versions
Google Cloud SDK 207.0.0
app-engine-java 1.9.64
app-engine-python 1.9.71
app-engine-python-extras 1.9.69
beta 2018.06.22
bq 2.0.34
core 2018.06.22
gcloud
gsutil 4.32
Any help on what I should try for next steps would be greatly appreciated.
Thanks in advance!
Related
Problem
Running a datastore query with or without FetchOptions.Builder.withLimit(100) takes the same execution time! Why is that? Isn't the limit method intended to reduce the time to retrieve results!?
Test setup
I am locally testing the execution time of some datastore queries with Google's App Engine. I am using the Google Cloud SDK Standard Environment with the App Engine SDK 1.9.59.
For the test, I created an example entity with 5 indexed properties and 5 unindexed properties. I filled the datastore with 50.000 entries of a test entity. I run the following method to retrieve 100 of this entities by utilizing the withLimit() method.
public List<Long> getTestIds() {
List<Long> ids = new ArrayList<>();
FetchOptions fetchOptions = FetchOptions.Builder.withLimit(100);
Query q = new Query("test_kind").setKeysOnly();
for (Entity entity : datastore.prepare(q).asIterable(fetchOptions)) {
ids.add(entity.getKey().getId());
}
return ids;
}
I measure the time before and after calling this method:
long start = System.currentTimeMillis();
int size = getTestIds().size();
long end = System.currentTimeMillis();
log.info("time: " + (end - start) + " results: " + size);
I log the execution time and the number of returned results.
Results
When I do not use the withLimit() FetchOptions for the query, I get the expected 50.000 results in about 1740 ms. Nothing surprising here.
If I run the code as displayed above and use withLimit(100) I get the expected 100 results. However, the query runs about the same 1740 ms!
I tested with different numbers of datastore entries and different limits. Every time the queries with or without withLimit(100) took the same time.
Question
Why is the query still fetching all entities? I am sure the query is not supposed to get all entities even though the limit is set to 100 right? What am I missing? Is there some datastore configuration for that? After testing and searching the web for 4 days I still can't find the problem.
FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
See for example Datastore fetch VS fetch(keys_only=True) then get_multi (including comments)
I'm flummoxed.
I noticed today that some data I thought should be present in my production appengine app wasn't showing up. I connected to the app via the remote console and ran the queries manually. Sure enough it looked like I only had 15 of the 101 rows I was expecting to see.
Then I went to my admin console at appengine.google.com and fired up the datastore viewer with the following query:
SELECT * FROM Assignment where game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')
The result I see is the first page of 20 results. I page through those results, and am able to see all 101 entities. HOORAY! My data is still there. BUT why then can't I access it via the db api? (NOTE: I've already tried clearing memcache via the memcache viewer, even though this particularly query isn't manually memcached)
From the remote console:
> from google.appengine.ext.db import GqlQuery
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
15
The remote console agrees with the app itself, which only seems to be able to see 15 of the expected 101 rows.
What gives?
UPDATE:
I suspect this might be an indexing issue. If I issue get_by_key_name for one of the missing rows, it subsequently shows up in db api queries.
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
15
> entities.Assignment.get_by_key_name('201212-assignment-135.9')
<entities.Assignment object at 0xa11eb6c>
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
16
So should I (or can I) rebuild my indexes to remedy this problem?
UPDATE #2:
I attempted to build a perfect index for this query, and have just verified that even when the query does use the just-built index (via query.index_list()), the results are still only limited to a small subset of items available via the datastore viewer. Infuriatingly, it's actually a different subset than is available with the previous index (20 items vs 15 items). So now adding an additional filter term results in an additional 5 rows returned. So dumb.
All indexes claim to be "serving" so there shouldn't be any reason that the indexes are this far off.
UPDATE #3:
Sometimes, using my new index, I'll get the right answer:
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb') and user = 'zee'").count()
101
However if I issue this query 10 times, it comes back with the 'bad' results about half the time:
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb') and user = 'zee'").count()
16
So maybe its an issue of a bad/behind bigtable replica that I'm hitting half the time, or something else completely opaque that we won't get an answer to (appengine status does list a service disruption today), but I have a feeling that this will be fixed on its own. Will update again if it does.
FINAL UPDATE:
As I suspected, when I woke up this morning my app (and manual queries) now see a consistent, correct view of the data. Would still love an answer as to why this happened, but until I get that I'm going to chalk it up to internal Google bigtable weirdness.
I filed this issue against appengine to see if I can get an answer from someone in the know.
For HRD applications, this is working as intended. App Engine High Replication Datastore (HRD) stores your data synchronously in multiple datacenters. However, the delay from the time a write is committed until it becomes visible in all datacenters means that queries across multiple entity groups (non-ancestor queries) can only guarantee eventually consistent results. [1]
In your specific case, the discrepancy between the results from your application and the Admin Console Datastore Viewer is just because they most likely are reading from different Datastore servers with different consistency.
If you require a consistent view of your data, I advise taking a closer look into the article "Structuring Data for Strong Consistency"
[1] https://developers.google.com/appengine/docs/java/datastore/structuring_for_strong_consistency
I am trying to use app engine's search API to search locations:
https://developers.google.com/appengine/docs/python/search/overview#Performing_Location-Based_Searches
The problem is no matter what I do, I get zero results. I set the search lat/lng as the the exact point on a document's GeoPoint property and it still returns zero.
I know the regular search is working because if I change the query to be a regular full-text search, it works.
Here is an example of my data (this is actually from the example app here: http://www.youtube.com/watch?v=cE6gb5pqr1k)
Full Text Search > stores1
Document Id: sanjose
Field Name Field Value
store_address 123 Main St.
store_location search.GeoPoint(latitude=37.37, longitude=-121.92)
store_name San Jose
And then my query:
index = search.Index('stores1')
loc = (37.37, -121.92)
query = "distance(store_location, geopoint(37.37, -121.92)) < 4500"
loc_expr = "distance(store_location, geopoint(37.37, -121.92))"
sortexpr = search.SortExpression(
expression=loc_expr,
direction=search.SortExpression.ASCENDING, default_value=4501)
search_query = search.Query(
query_string=query,
options=search.QueryOptions(
sort_options=search.SortOptions(expressions=[sortexpr])))
results = index.search(search_query)
print results
And the returns:
search.SearchResults(number_found=0L)
Am I missing something or doing something wrong? This should return at least that one result, right?
** UPDATE **
After doing some prying/searching/testing I think this may be a bug regarding the google app engine development server.
If I run location searches on the same data in the production environment, I get expected results. When I compare and run the exact same query on the data in the development environment, I get the unexpected 0 results.
If anybody has any insight on this, please advise. Otherwise, for those of you seeing the same problem, I created an issue on app engine's issue tracker
here.
You've probably already figured this out, but in case someone comes across this post, the geosearch feature of AppEngine's Search API returns zero results on the dev server. From https://developers.google.com/appengine/training/fts_intro/lesson2:
"...some search queries are not fully supported on the Development Web Server (running locally), so you’ll need to run them using a deployed application."
Here's another useful link:
https://developers.google.com/appengine/docs/python/search/devserver
class MyEntity(db.Model):
timestamp = db.DateTimeProperty()
title = db.StringProperty()
number = db.FloatProperty()
db.GqlQuery("SELECT * FROM MyEntity WHERE title = 'mystring' AND timestamp >= date('2012-01-01') AND timestamp <= date('2012-12-31') ORDER BY timestamp DESC").fetch(1000)
This should fetch ~600 entities on app engine. On my dev server it behaves as expected, builds the index.yaml, I upload it, test on server but on app engine it does not return anything.
Index:
- kind: MyEntity
properties:
- name: title
- name: timestamp
direction: desc
I try splitting the query down on datastore viewer to see where the issue is and the timestamp constraints work as expected. The query returns nothing on WHERE title = 'mystring' when it should be returning a bunch of entities.
I vaguely remember fussy filtering where you had to call .filter("prop =",propValue) with the space between property and operator, but this is a GqlQuery so it's not that (and I tried that format with the GQL too).
Anyone know what my issue is?
One thing I can think of:
I added the list of MyEntity entities into the app via BulkLoader.py prior to the new index being created on my devserver & uploaded. Would that make a difference?
The last line you wrote is probably the problem.
Your entities in the actual real datastore are missing the index required for the query.
As far as I know, when you add a new index, App Engine is supposed to rebuild your indexes for you. This may take some time. You can check your admin page to check the state of your indexes and see if it's still building.
Turns out there's a slight bug in the bulkloader supplied with App Engine SDK - basically autogenerated config transforms strings as db.Text, which is no good if you want these fields indexed. The correct import_transform directive should be:
transform.none_if_empty(str)
This will instruct App Engine to index the uploaded field as a db.StringProperty().
Is GQL easy to learn for someone who knows SQL? How is Django/Python? Does App Engine really make scaling easy? Is there any built-in protection against "GQL Injections"? And so on...
I'd love to hear the not-so-obvious ups and downs of using app engine.
Cheers!
My experience with google app engine has been great, and the 1000 result limit has been removed, here is a link to the release notes:
app-engine release notes
No more 1000 result limit - That's
right: with addition of Cursors and
the culmination of many smaller
Datastore stability and performance
improvements over the last few months,
we're now confident enough to remove
the maximum result limit altogether.
Whether you're doing a fetch,
iterating, or using a Cursor, there's
no limits on the number of results.
The most glaring and frustrating issue is the datastore api, which looks great and is very well thought out and easy to work with if you are used to SQL, but has a 1000 row limit across all query resultsets, and you can't access counts or offsets beyond that. I've run into weirder issues, with not actually being able to add or access data for a model once it goes beyond 1000 rows.
See the Stack Overflow discussion about the 1000 row limit
Aral Balkan wrote a really good summary of this and other problems
Having said that, app engine is a really great tool to have at ones disposal, and I really enjoy working with it. It's perfect for deploying micro web services (eg: json api's) to use in other apps.
GQL is extremely simple - it's a subset of the SQL 'SELECT' statement, nothing more. It's only a convenience layer over the top of the lower-level APIs, though, and all the parsing is done in Python.
Instead, I recommend using the Query API, which is procedural, requires no run-time parsing, and makes 'GQL injection' vulnerabilities totally impossible (though they are impossible in properly written GQL anyway). The Query API is very simple: Call .all() on a Model class, or call db.Query(modelname). The Query object has .filter(field_and_operator, value), .order(field_and_direction) and .ancestor(entity) methods, in addition to all the facilities GQL objects have (.get(), .fetch(), .count()), etc.) Each of the Query methods returns the Query object itself for convenience, so you can chain them:
results = MyModel.all().filter("foo =", 5).order("-bar").fetch(10)
Is equivalent to:
results = MyModel.gql("WHERE foo = 5 ORDER BY bar DESC LIMIT 10").fetch()
A major downside when working with AppEngine was the 1k query limit, which has been mentioned in the comments already. What I haven't seen mentioned though is the fact that there is a built-in sortable order, with which you can work around this issue.
From the appengine cookbook:
def deepFetch(queryGen,key=None,batchSize = 100):
"""Iterator that yields an entity in batches.
Args:
queryGen: should return a Query object
key: used to .filter() for __key__
batchSize: how many entities to retrieve in one datastore call
Retrieved from http://tinyurl.com/d887ll (AppEngine cookbook).
"""
from google.appengine.ext import db
# AppEngine will not fetch more than 1000 results
batchSize = min(batchSize,1000)
query = None
done = False
count = 0
if key:
key = db.Key(key)
while not done:
print count
query = queryGen()
if key:
query.filter("__key__ > ",key)
results = query.fetch(batchSize)
for result in results:
count += 1
yield result
if batchSize > len(results):
done = True
else:
key = results[-1].key()
The above code together with Remote API (see this article) allows you to retrieve as many entities as you need.
You can use the above code like this:
def allMyModel():
q = MyModel.all()
myModels = deepFetch(allMyModel)
At first I had the same experience as others who transitioned from SQL to GQL -- kind of weird to not be able to do JOINs, count more than 1000 rows, etc. Now that I've worked with it for a few months I absolutely love the app engine. I'm porting all of my old projects onto it.
I use it to host several high-traffic web applications (at peak time one of them gets 50k hits a minute.)
Google App Engine doesn't use an actual database, and apparently uses some sort of distributed hash map. This will lend itself to some different behaviors that people who are accustomed to SQL just aren't going to see at first. So for example getting a COUNT of items in regular SQL is expected to be a fast operation, but with GQL it's just not going to work the same way.
Here are some more issues:
http://blog.burnayev.com/2008/04/gql-limitations.html
In my personal experience, it's an adjustment, but the learning curve is fine.