I am using GAE for my server where I have all my entities in Datastore. One of the entity has more than 2000 records, and it is taking almost 30 secs to read whole entity. So I wanted to use cache to improve performance.
I have tried Datastore objectify #cache annotation, but not finding
how to read from the stored cache. I have declared entity as below:
#Entity
#Cache
public class Devices{
}
Second thing I tried is memcache. I am storing whole List s
in key, but this is not storing, I couldn't see in console memcache,
but at the same time not showing any errors or exceptions while
storing objects.
putvalue("temp", List<Devices>)
public void putValue(String key, Object value) {
Cache cache = getCache();
logger.info(TAG + "getCache() :: storing memcache for key : " + key);
try {
if (cache != null) {
cache.put(key, value);
}
}catch (Exception e) {
logger.info(TAG + "getCache() :: exception : " + e);
}
}
When I tried to retrieve using getValue("temp"), it is returning
null or empty.
Object object = cache.get(key);
My main object is to limit the time to 5secs to get all the records of entity.
Can anyone suggest what I am doing wrong here? Or any better solution to retrieve the records fast from Datastore.
Datastore Objectify actually uses the App Engine Memcache service to cache your entity data globally when you use the #Cache annotation. However, as explained in the doc here, only get-by-key, save(), and delete() interact with the cache. Query operations are not cached.
Regarding the App Engine Memcache method, you may be hitting the limit for the maximum size of a cached data value which is 1 MiB, although I believe this raise an exception indeed.
Regarding the query itself, you may be better off using a keys_only query and then doing a key.get() on each returned key. That way, Memcache will be used for each record.
Related
I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).
I want to test if an object exists in the datastore. I know its key. I am doing this right now by loading the entire object:
public boolean doesObjectExist(String knownFooId) {
Key<Foo> key = Key.create(Foo.class, knownFooId);
Foo foo = ofy().load().key(key).now();
if (foo != null) {
// yes it exists.
return true;
}
return false;
}
That must cost 1 read operation from the datastore. Is there a way to do it without having to load the entire object, that might be cheaper? In other words, a way that it would only cost 1 "small" operation?
Thanks
There's no way to do it cheaper.
Even if you just do a keys only query, the query is 1 Read operation + 1 Small operation per key fetched. (https://cloud.google.com/appengine/pricing#costs-for-datastore-calls)
Keep doing a get by key, which is just 1 Read.
public boolean doesObjectExist(String knownFooId) {
Key<Foo> fooKey = Key.create(Foo.class, knownFooId);
Key<Foo> datastoreKey = ofy().load().type(Foo.class).filterKey(fooKey).keys().first().now();
return datastoreKey.equals(fooKey);
}
From documentation:
QueryKeys keys()
Switches to a keys-only query. Keys-only responses are billed as "minor datastore operations" which are faster and free compared to fetching whole entities.
You could try to fetch the key, as far as I understand it'd be only a small operation.
// You can query for just keys, which will return Key objects much more efficiently than fetching whole objects
Iterable<Key<F>> allKeys = ofy().load().type(Foo.class).filter("id", knownFooId).keys();
It should work. Also take a look at the objectfy docs: https://github.com/objectify/objectify/wiki/Queries
I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.
I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.
However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).
Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:
def product_bulk_import_map(data):
"""Product Bulk Import map function."""
result = {"status" : "CREATED"}
product_data = data
try:
# parse input parameter tuple
byteoffset, line_data = data
# parse base product data
product_data = [x for x in csv.reader([line_data])][0]
(p_id, c_id, p_type, p_description) = product_data
# process category
category = Category.get_by_id(c_id)
if category is None:
raise Exception(product_import_error_messages["category"] % c_id)
# store in datastore
product = Product.get_by_id(p_id)
if product is not None:
result["status"] = "UPDATED"
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description
)
product.put()
result["entity"] = product.to_dict()
except Exception as e:
# catch any exceptions, and note failure in output
result["status"] = "FAILED"
result["entity"] = str(e)
# return results
yield (str(product_data), result)
MapReduce intentionally disables memcache for NDB.
See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):
def _set_ndb_cache_policy():
"""Tell NDB to never cache anything in memcache or in-process.
This ensures that entities fetched from Datastore input_readers via NDB
will not bloat up the request memory size and Datastore Puts will avoid
doing calls to memcache. Without this you get soft memory limit exits,
which hurts overall throughput.
"""
ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key: False)
ndb_ctx.set_memcache_policy(lambda key: False)
You can force get_by_id() and put() to use memcache, eg:
product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)
Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:
ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)
As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).
It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.
As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.
You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext
i have a simple question
in the objectify documentation it says that "Only get(), put(), and delete() interact with the cache. query() is not cached"
http://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Global_Cache.
what i'm wondering - if you have one root entity (i did not use #Parent due to all the scalability issues that it seems to have) that all the other entities have a Key to, and you do a query such as
ofy.query(ChildEntity.class).filter("rootEntity", rootEntity).list()
is this completely bypassing the cache?
If this is the case, is there an efficient caching way to do a query on conditions - or for that matter can you cache a query with a parent where you would have to make an actual ancestor query like the following
Key<Parent> rootKey = ObjectifyService.factory().getKey(root)
ofy.query(ChildEntity.class).ancestor(rootKey)
Thank you
as to one of the comments below i've added an edit
sample dao (ignore the validate method - it just does some null & quantity checks):
this is a sample find all method inside a delegate called from the DAO that the request factory ServiceLocator is using
public List<EquipmentCheckin> findAll(Subject subject, Objectify ofy, Event event) {
final Business business = (Business) subject.getSession().getAttribute(BUSINESS_ATTRIBUTE);
final List<EquipmentCheckin> checkins = ofy.query(EquipmentCheckin.class).filter(BUSINESS_ATTRIBUTE, business)
.filter(EVENT_CONDITION, event).list();
return validate(ofy, checkins);
}
now, when this is executed i find that the following method is actually being called in my AbstractDAO.
/**
*
* #param id
* #return
*/
public T find(Long id) {
System.out.println("finding " + clazz.getSimpleName() + " id = " + id);
return ObjectifyService.begin().find(clazz, id);
}
Yes, all queries bypass Objectify's integrated memcache and fetch results directly from the datastore. The datastore provides the (increasingly sophisticated) query engine that understands how to return results; determining cache invalidation for query results is pretty much impossible from the client side.
On the other hand, Objectify4 does offer a hybrid query cache whereby queries are automagically converted to a keys-only query followed by a batch get. The keys-only query still requires the datastore, but any entity instances are pulled from (and populate on miss) memcache. It might save you money.
I have following entity (non-relevant fields/methods are removed).
public class HitsStatsTotalDO
{
#Id
transient private Long targetId;
public Key<HitsStatsTotalDO> createKey()
{
return new Key<HitsStatsTotalDO>(HitsStatsTotalDO.class, targetId);
}
}
So... I'm trying to do batch get for 10 objects for which I construct keys using HitsStatsTotalDO.createKey(). I'm attempting to fetch them in transaction like this:
final List<Key<HitsStatsTotalDO>> keys = ....
// This is being called in transaction..
Map<Key<HitsStatsTotalDO>, HitsStatsTotalDO> result = DAOBase.ofy().get(keys);
which throws following exception:
java.lang.IllegalArgumentException: operating on too many entity groups in a single transaction.
Could you please elaborate how many is too many and how to fix it ? I couldn't find exact number in the documentation.
Thanks!
The issue is not the number of entities you're retrieving, it's the fact that they're in multiple entity groups. Either do the fetch outside a transaction, or use an XG (Cross Group) transaction.
In a single transaction you can operate entities in the same entity group.
What Can Be Done In a Transaction