i have a simple question
in the objectify documentation it says that "Only get(), put(), and delete() interact with the cache. query() is not cached"
http://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Global_Cache.
what i'm wondering - if you have one root entity (i did not use #Parent due to all the scalability issues that it seems to have) that all the other entities have a Key to, and you do a query such as
ofy.query(ChildEntity.class).filter("rootEntity", rootEntity).list()
is this completely bypassing the cache?
If this is the case, is there an efficient caching way to do a query on conditions - or for that matter can you cache a query with a parent where you would have to make an actual ancestor query like the following
Key<Parent> rootKey = ObjectifyService.factory().getKey(root)
ofy.query(ChildEntity.class).ancestor(rootKey)
Thank you
as to one of the comments below i've added an edit
sample dao (ignore the validate method - it just does some null & quantity checks):
this is a sample find all method inside a delegate called from the DAO that the request factory ServiceLocator is using
public List<EquipmentCheckin> findAll(Subject subject, Objectify ofy, Event event) {
final Business business = (Business) subject.getSession().getAttribute(BUSINESS_ATTRIBUTE);
final List<EquipmentCheckin> checkins = ofy.query(EquipmentCheckin.class).filter(BUSINESS_ATTRIBUTE, business)
.filter(EVENT_CONDITION, event).list();
return validate(ofy, checkins);
}
now, when this is executed i find that the following method is actually being called in my AbstractDAO.
/**
*
* #param id
* #return
*/
public T find(Long id) {
System.out.println("finding " + clazz.getSimpleName() + " id = " + id);
return ObjectifyService.begin().find(clazz, id);
}
Yes, all queries bypass Objectify's integrated memcache and fetch results directly from the datastore. The datastore provides the (increasingly sophisticated) query engine that understands how to return results; determining cache invalidation for query results is pretty much impossible from the client side.
On the other hand, Objectify4 does offer a hybrid query cache whereby queries are automagically converted to a keys-only query followed by a batch get. The keys-only query still requires the datastore, but any entity instances are pulled from (and populate on miss) memcache. It might save you money.
Related
I'm trying to query a dataset in cloud firestore which has 180k documents, but process is extremely long(70 seconds), in order to avoid this, should i split my collection to subcollections or is there anyway to make it more efficient ?
QUERY FUNCTION
Future getProfList(String uni, String department, bool asynCall) async {
List<Academician> academicianList = [];
await FirebaseFirestore.instance
.collection('academicians')
.where('university', isEqualTo: academicianFilter(uni))
.where(stringCorrector('field'), isEqualTo: academicianFilter(department))
.get()
.then((value) => value.docs.forEach((element) {
academicianList.add(Academician.fromJson(element));
}));
asynCall = false;
return academicianList;
}
official doc
According to https://firebase.googleblog.com/2017/10/introducing-cloud-firestore.html
my dataset shouldn't be problem here, and my result set is mostly 50-100 document.
Uses collections and documents to structure and query data. This data
model is familiar and intuitive for many developers. It also allows
for expressive queries. Queries scale with the size of your result
set, not the size of your data set, so you'll get the same performance
fetching 1 result from a set of 100, or 100,000,000.
Firestore actually has a guarantee that the time it takes to execute a query depends on the amount of data that query returns, and not in any way on the amount of data that exists in the collection.
Unfortunately (as confirmed in the comments to your question) you're hitting an edge case here. This guarantee applies to queries run on the server, which is the most common use-case.
But since you added the data from the same device, you have a local database/cache on that device that also contains all these documents. And the performance guarantee does not apply for queries against the local cache.
So the easiest to get the expected performance is to clear the local cache, for example by uninstalling/reinstalling the app. Then you'll be in the more common scenario, where your query is sent to the server and takes time that is (only) proportional to the number of documents you retrieve.
I am using GAE for my server where I have all my entities in Datastore. One of the entity has more than 2000 records, and it is taking almost 30 secs to read whole entity. So I wanted to use cache to improve performance.
I have tried Datastore objectify #cache annotation, but not finding
how to read from the stored cache. I have declared entity as below:
#Entity
#Cache
public class Devices{
}
Second thing I tried is memcache. I am storing whole List s
in key, but this is not storing, I couldn't see in console memcache,
but at the same time not showing any errors or exceptions while
storing objects.
putvalue("temp", List<Devices>)
public void putValue(String key, Object value) {
Cache cache = getCache();
logger.info(TAG + "getCache() :: storing memcache for key : " + key);
try {
if (cache != null) {
cache.put(key, value);
}
}catch (Exception e) {
logger.info(TAG + "getCache() :: exception : " + e);
}
}
When I tried to retrieve using getValue("temp"), it is returning
null or empty.
Object object = cache.get(key);
My main object is to limit the time to 5secs to get all the records of entity.
Can anyone suggest what I am doing wrong here? Or any better solution to retrieve the records fast from Datastore.
Datastore Objectify actually uses the App Engine Memcache service to cache your entity data globally when you use the #Cache annotation. However, as explained in the doc here, only get-by-key, save(), and delete() interact with the cache. Query operations are not cached.
Regarding the App Engine Memcache method, you may be hitting the limit for the maximum size of a cached data value which is 1 MiB, although I believe this raise an exception indeed.
Regarding the query itself, you may be better off using a keys_only query and then doing a key.get() on each returned key. That way, Memcache will be used for each record.
I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).
When you use NHibernate to "fetch" a mapped object, it outputs a SELECT query to the database. It outputs this using parameters; so if I query a list of cars based on tenant ID and name, I get:
select Name, Location from Car where tenantID=#p0 and Name=#p1
This has the nice benefit of our database creating (and caching) a query plan based on this query and the result, so when it is run again, the query is much faster as it can load the plan from the cache.
The problem with this is that we are a multi-tenant database, and almost all of our indexes are partition aligned. Our tenants have vastly different data sets; one tenant could have 5 cars, while another could have 50,000. And so because NHibernate does this, it has the net effect of our database creating and caching a plan for the FIRST tenant that runs it. This plan is likely not efficient for subsequent tenants who run the query.
What I WANT to do is force NHibernate NOT to parameterize certain parameters; namely, the tenant ID. So I'd want the query to read:
select Name, Location from Car where tenantID=55 and Name=#p0
I can't figure out how to do this in the HBM.XML mapping. How can I dictate to NHibernate how to use parameters? Or can I just turn parameters off altogether?
OK everyone, I figured it out.
The way I did it was overriding the SqlClientDriver with my own custom driver that looks like this:
public class CustomSqlClientDriver : SqlClientDriver
{
private static Regex _partitionKeyReplacer = new Regex(#".PartitionKey=(#p0)", RegexOptions.Compiled);
public override void AdjustCommand(IDbCommand command)
{
var m = _tenantIDReplacer.Match(command.CommandText);
if (!m.Success)
return;
// replace the first parameter with the actual partition key
var parameterName = m.Groups[1].Value;
// find the parameter value
var tenantID = (IDbDataParameter ) command.Parameters[parameterName];
var valueOfTenantID = tenantID.Value;
// now replace the string
command.CommandText = _tenantIDReplacer.Replace(command.CommandText, ".TenantID=" + valueOfTenantID);
}
} }
I override the AdjustCommand method and use a Regex to replace the tenantID. This works; not sure if there's a better way, but I really didn't want to have to open up NHibernate and start messing with core code.
You'll have to register this custom driver in the connection.driver_class property of the SessionFactory upon initialization.
Hope this helps somebody!
NB: I am using db (not ndb) here. I know ndb has a count_async() but I am hoping for a solution that does not involve migrating over to ndb.
Occasionally I need an accurate count of the number of entities that match a query. With db this is simply:
q = some Query with filters
num_entities = q.count(limit=None)
It costs a small db operation per entity but it gets me the info I need. The problem is that I often need to do a few of these in the same request and it would be nice to do them asynchronously but I don't see support for that in the db library.
I was thinking I could use run(keys_only=True, batch_size=1000) as it runs the query asynchronously and returns an iterator. I could first call run() on each query and then later count the results from each iterator. It costs the same as count() however run() has proven to be slower in testing (perhaps because it actually returns results) and in fact it seems that batch_size is limited at 300 regardless of how high I set it which requires more RPCs to do a count of thousands of entities than the count() method does.
My test code for run() looks like this:
queries = list of Queries with filters
iters = []
for q in queries:
iters.append( q.run(keys_only=True, batch_size=1000) )
for iter in iters:
count_entities_from(iter)
No, there's no equivalent in db. The whole point of ndb is that it adds these sort of capabilities which were missing in db.