When executing a GQL query in Datastore which includes a "IN ARRAY(..)", the query execution does not always complete correctly: it is seems it depends on the specific fetch entities.
More precisely, the exception is fired by the java version of the DS API, when executing the hasNext method on the QueryResults object, i.e. when reading al returned entities.
For example, it can get back the first entity and then fire an exception on the second invocation of hasNext. It is a repeatable behavior, always with the same entities.
Example of query execution:
String sql = "SELECT * FROM myentity WHERE myfield IN ARRAY('abc','efg') and some other conditions LIMIT 10001 OFFSET 0";
GqlQuery.Builder builder = GqlQuery.newGqlQueryBuilder(sql);
builder.setAllowLiteral(true);
GqlQuery query = builder.build();
QueryResults response = datastore.run(query);
while(response.hasNext()) { // <- here there is the error
entity = (Entity)response.next();
...
}
The exception fired is:
Can't get the number of an unknown enum value.
java.lang.IllegalArgumentException: Can't get the number of an unknown enum value.
at com.google.datastore.v1.PropertyFilter$Operator.getNumber(PropertyFilter.java:233)
at com.google.datastore.v1.PropertyFilter$Builder.setOp(PropertyFilter.java:929)
at com.google.cloud.datastore.StructuredQuery$PropertyFilter.toPb(StructuredQuery.java:519)
at com.google.cloud.datastore.StructuredQuery$CompositeFilter.toPb(StructuredQuery.java:237)
at com.google.cloud.datastore.StructuredQuery.toPb(StructuredQuery.java:1011)
at com.google.cloud.datastore.StructuredQuery.populatePb(StructuredQuery.java:975)
at com.google.cloud.datastore.QueryResultsImpl.sendRequest(QueryResultsImpl.java:71)
at com.google.cloud.datastore.QueryResultsImpl.computeNext(QueryResultsImpl.java:93)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
It happens also with the google java API version of datastore of dicember 2022, so it seems not related to a specific version.
I don't want any exception!
Related
Problem
Running a datastore query with or without FetchOptions.Builder.withLimit(100) takes the same execution time! Why is that? Isn't the limit method intended to reduce the time to retrieve results!?
Test setup
I am locally testing the execution time of some datastore queries with Google's App Engine. I am using the Google Cloud SDK Standard Environment with the App Engine SDK 1.9.59.
For the test, I created an example entity with 5 indexed properties and 5 unindexed properties. I filled the datastore with 50.000 entries of a test entity. I run the following method to retrieve 100 of this entities by utilizing the withLimit() method.
public List<Long> getTestIds() {
List<Long> ids = new ArrayList<>();
FetchOptions fetchOptions = FetchOptions.Builder.withLimit(100);
Query q = new Query("test_kind").setKeysOnly();
for (Entity entity : datastore.prepare(q).asIterable(fetchOptions)) {
ids.add(entity.getKey().getId());
}
return ids;
}
I measure the time before and after calling this method:
long start = System.currentTimeMillis();
int size = getTestIds().size();
long end = System.currentTimeMillis();
log.info("time: " + (end - start) + " results: " + size);
I log the execution time and the number of returned results.
Results
When I do not use the withLimit() FetchOptions for the query, I get the expected 50.000 results in about 1740 ms. Nothing surprising here.
If I run the code as displayed above and use withLimit(100) I get the expected 100 results. However, the query runs about the same 1740 ms!
I tested with different numbers of datastore entries and different limits. Every time the queries with or without withLimit(100) took the same time.
Question
Why is the query still fetching all entities? I am sure the query is not supposed to get all entities even though the limit is set to 100 right? What am I missing? Is there some datastore configuration for that? After testing and searching the web for 4 days I still can't find the problem.
FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
See for example Datastore fetch VS fetch(keys_only=True) then get_multi (including comments)
I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.
I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.
However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).
Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:
def product_bulk_import_map(data):
"""Product Bulk Import map function."""
result = {"status" : "CREATED"}
product_data = data
try:
# parse input parameter tuple
byteoffset, line_data = data
# parse base product data
product_data = [x for x in csv.reader([line_data])][0]
(p_id, c_id, p_type, p_description) = product_data
# process category
category = Category.get_by_id(c_id)
if category is None:
raise Exception(product_import_error_messages["category"] % c_id)
# store in datastore
product = Product.get_by_id(p_id)
if product is not None:
result["status"] = "UPDATED"
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description
)
product.put()
result["entity"] = product.to_dict()
except Exception as e:
# catch any exceptions, and note failure in output
result["status"] = "FAILED"
result["entity"] = str(e)
# return results
yield (str(product_data), result)
MapReduce intentionally disables memcache for NDB.
See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):
def _set_ndb_cache_policy():
"""Tell NDB to never cache anything in memcache or in-process.
This ensures that entities fetched from Datastore input_readers via NDB
will not bloat up the request memory size and Datastore Puts will avoid
doing calls to memcache. Without this you get soft memory limit exits,
which hurts overall throughput.
"""
ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key: False)
ndb_ctx.set_memcache_policy(lambda key: False)
You can force get_by_id() and put() to use memcache, eg:
product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)
Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:
ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)
As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).
It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.
As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.
You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext
The index.yaml file of my GAE app is no longer updated by the development server.
I have recently added a new kind to my app and a handler that queries this kind like so:
from google.appengine.ext import ndb
class MyKind(ndb.Model):
thing = ndb.TextProperty()
timestamp = ndb.DateTimeProperty(auto_now_add=True)
and in the handler I have a query
query = MyKind.query()
query.order(-MyKind.timestamp)
logging.info(query.iter().index_list())
entities = query.fetch(100)
for entity in entities:
# do something
AFAIK, the development server should create an index for this query and update index.yaml accordingly. However, it doesn't. It just looks like this:
indexes:
# AUTOGENERATED
The logging.info(query.iter().index_list()) should output the index used for the query, it just says 'None'. Also, the SDK console says 'Datastore contains no indexes.'
Running the query returns the entities unsorted. I have two questions:
is there some syntax error in my code causes the query results be unsorted or is it the missing index?
if it's the missing index, is there a way to manually force the dev server to update index.yaml? Other suggestions?
Thank you
your call to order returns the new query..
query = MyKind.query()
query = query.order(-MyKind.timestamp)
..to clarify..
query.order(-MyKind.timestamp) does not change the query, it returns a new one, so you need to use the query returned by that method. As it is query.order(-MyKind.timestamp) in your code does nothing.
class MyEntity(db.Model):
timestamp = db.DateTimeProperty()
title = db.StringProperty()
number = db.FloatProperty()
db.GqlQuery("SELECT * FROM MyEntity WHERE title = 'mystring' AND timestamp >= date('2012-01-01') AND timestamp <= date('2012-12-31') ORDER BY timestamp DESC").fetch(1000)
This should fetch ~600 entities on app engine. On my dev server it behaves as expected, builds the index.yaml, I upload it, test on server but on app engine it does not return anything.
Index:
- kind: MyEntity
properties:
- name: title
- name: timestamp
direction: desc
I try splitting the query down on datastore viewer to see where the issue is and the timestamp constraints work as expected. The query returns nothing on WHERE title = 'mystring' when it should be returning a bunch of entities.
I vaguely remember fussy filtering where you had to call .filter("prop =",propValue) with the space between property and operator, but this is a GqlQuery so it's not that (and I tried that format with the GQL too).
Anyone know what my issue is?
One thing I can think of:
I added the list of MyEntity entities into the app via BulkLoader.py prior to the new index being created on my devserver & uploaded. Would that make a difference?
The last line you wrote is probably the problem.
Your entities in the actual real datastore are missing the index required for the query.
As far as I know, when you add a new index, App Engine is supposed to rebuild your indexes for you. This may take some time. You can check your admin page to check the state of your indexes and see if it's still building.
Turns out there's a slight bug in the bulkloader supplied with App Engine SDK - basically autogenerated config transforms strings as db.Text, which is no good if you want these fields indexed. The correct import_transform directive should be:
transform.none_if_empty(str)
This will instruct App Engine to index the uploaded field as a db.StringProperty().
i have a simple question
in the objectify documentation it says that "Only get(), put(), and delete() interact with the cache. query() is not cached"
http://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Global_Cache.
what i'm wondering - if you have one root entity (i did not use #Parent due to all the scalability issues that it seems to have) that all the other entities have a Key to, and you do a query such as
ofy.query(ChildEntity.class).filter("rootEntity", rootEntity).list()
is this completely bypassing the cache?
If this is the case, is there an efficient caching way to do a query on conditions - or for that matter can you cache a query with a parent where you would have to make an actual ancestor query like the following
Key<Parent> rootKey = ObjectifyService.factory().getKey(root)
ofy.query(ChildEntity.class).ancestor(rootKey)
Thank you
as to one of the comments below i've added an edit
sample dao (ignore the validate method - it just does some null & quantity checks):
this is a sample find all method inside a delegate called from the DAO that the request factory ServiceLocator is using
public List<EquipmentCheckin> findAll(Subject subject, Objectify ofy, Event event) {
final Business business = (Business) subject.getSession().getAttribute(BUSINESS_ATTRIBUTE);
final List<EquipmentCheckin> checkins = ofy.query(EquipmentCheckin.class).filter(BUSINESS_ATTRIBUTE, business)
.filter(EVENT_CONDITION, event).list();
return validate(ofy, checkins);
}
now, when this is executed i find that the following method is actually being called in my AbstractDAO.
/**
*
* #param id
* #return
*/
public T find(Long id) {
System.out.println("finding " + clazz.getSimpleName() + " id = " + id);
return ObjectifyService.begin().find(clazz, id);
}
Yes, all queries bypass Objectify's integrated memcache and fetch results directly from the datastore. The datastore provides the (increasingly sophisticated) query engine that understands how to return results; determining cache invalidation for query results is pretty much impossible from the client side.
On the other hand, Objectify4 does offer a hybrid query cache whereby queries are automagically converted to a keys-only query followed by a batch get. The keys-only query still requires the datastore, but any entity instances are pulled from (and populate on miss) memcache. It might save you money.