Issue related to Google App Engine query within a date range - google-app-engine

I am concerned about querying entities this way
created_start = datetime.today()
created_start = created_start - timedelta(hours=1)
created_end = datetime.now()
a = Message.all()
a.filter('created >=',created_start)
a.filter('created <',created_end)
Due to the 1000 query results restriction. So two questions:
Will this work if .all() returns more that 1000 results? Or to put it in a different way. Will all() return more than a 1000 results incase there were more?
Is there a better way to achieve querying for entities between a given data range?
Thank you very much in advance

Your solution is good, since Version 1.3.6, query results are no longer capped at 1000.
You can iterate a entities until exhaustion or fetch chunks of entities using a cursor.

Related

CloudFirestore how to query large dataset?

I'm trying to query a dataset in cloud firestore which has 180k documents, but process is extremely long(70 seconds), in order to avoid this, should i split my collection to subcollections or is there anyway to make it more efficient ?
QUERY FUNCTION
Future getProfList(String uni, String department, bool asynCall) async {
List<Academician> academicianList = [];
await FirebaseFirestore.instance
.collection('academicians')
.where('university', isEqualTo: academicianFilter(uni))
.where(stringCorrector('field'), isEqualTo: academicianFilter(department))
.get()
.then((value) => value.docs.forEach((element) {
academicianList.add(Academician.fromJson(element));
}));
asynCall = false;
return academicianList;
}
official doc
According to https://firebase.googleblog.com/2017/10/introducing-cloud-firestore.html
my dataset shouldn't be problem here, and my result set is mostly 50-100 document.
Uses collections and documents to structure and query data. This data
model is familiar and intuitive for many developers. It also allows
for expressive queries. Queries scale with the size of your result
set, not the size of your data set, so you'll get the same performance
fetching 1 result from a set of 100, or 100,000,000.
Firestore actually has a guarantee that the time it takes to execute a query depends on the amount of data that query returns, and not in any way on the amount of data that exists in the collection.
Unfortunately (as confirmed in the comments to your question) you're hitting an edge case here. This guarantee applies to queries run on the server, which is the most common use-case.
But since you added the data from the same device, you have a local database/cache on that device that also contains all these documents. And the performance guarantee does not apply for queries against the local cache.
So the easiest to get the expected performance is to clear the local cache, for example by uninstalling/reinstalling the app. Then you'll be in the more common scenario, where your query is sent to the server and takes time that is (only) proportional to the number of documents you retrieve.

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Is there a count_async() equivalent for db models?

NB: I am using db (not ndb) here. I know ndb has a count_async() but I am hoping for a solution that does not involve migrating over to ndb.
Occasionally I need an accurate count of the number of entities that match a query. With db this is simply:
q = some Query with filters
num_entities = q.count(limit=None)
It costs a small db operation per entity but it gets me the info I need. The problem is that I often need to do a few of these in the same request and it would be nice to do them asynchronously but I don't see support for that in the db library.
I was thinking I could use run(keys_only=True, batch_size=1000) as it runs the query asynchronously and returns an iterator. I could first call run() on each query and then later count the results from each iterator. It costs the same as count() however run() has proven to be slower in testing (perhaps because it actually returns results) and in fact it seems that batch_size is limited at 300 regardless of how high I set it which requires more RPCs to do a count of thousands of entities than the count() method does.
My test code for run() looks like this:
queries = list of Queries with filters
iters = []
for q in queries:
iters.append( q.run(keys_only=True, batch_size=1000) )
for iter in iters:
count_entities_from(iter)
No, there's no equivalent in db. The whole point of ndb is that it adds these sort of capabilities which were missing in db.

How to get all results from solr query?

I executed some query like "Address:Jack*". It show numFound = 5214 and display 100 documents in results page(I changed default display results from 10 to 100).
How can I get all documents.
I remember myself doing &rows=2147483647
2,147,483,647 is integer's maximum value. I recall using a number bigger than that once and having a NumberFormatException because it couldn't be parsed into an int. I don't know if they use Long nowadays, but 2 billion rows is normally more than enough.
Small note:
Be careful if you are planning to do this in production. If you do a query like * : * and your index is big, you could transferring a couple of gigabytes in that query.
If you know you won't have many docs, go ahead and use integer's max value.
On the other hand, if you are doing a one-time script and just need to dump all results (for example document ID's) then this approach is valid, if you don't mind waiting 3-5 minutes for a query to return.
Don't use &rows=2147483647
Don't use Integer.MAX_VALUE(2147483647) as value of rows in production. This will heavily slow down your query even if you have a small resultset, because solr preallocates a queue in this size. see https://issues.apache.org/jira/browse/SOLR-7580
I strongly suggest to use Exporting Result Sets
It’s possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Or I suggest to use Deep Paging.
Simple Pagination is a easy thing when you have few documents to read and all you have to do is play with start and rows parameters. But this is not a feasible way when you have many documents, I mean hundreds of thousands or even millions.
This is the kind of thing that could bring your Solr server to their knees.
For typical applications displaying search results to a human user,
this tends to not be much of an issue since most users don’t care
about drilling down past the first handful of pages of search results
— but for automated systems that want to crunch data about all of the
documents matching a query, it can be seriously prohibitive.
This means that if you have a website and are paging search results, a real user do not go so further but consider on the other hand what can happen if a spider or a scraper try to read all the website pages.
Now we are talking of Deep Paging.
I’ll suggest to read this amazing post:
https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
And take a look at this document page:
https://solr.apache.org/guide/pagination-of-results.html
And here is an example that try to explain how to paginate using the cursors.
SolrQuery solrQuery = new SolrQuery();
solrQuery.setRows(500);
solrQuery.setQuery("*:*");
solrQuery.addSort("id", ORDER.asc); // Pay attention to this line
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (!done) {
solrQuery.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrClient.query(solrQuery);
String nextCursorMark = rsp.getNextCursorMark();
for (SolrDocument d : rsp.getResults()) {
...
}
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Returning all the results is never a good option as It would be very slow in performance.
Can you mention your use case ?
Also, Solr rows parameter helps you to tune the number of the results to be returned.
However, I don't think there is a way to tune rows to return all results. It doesn't take a -1 as value.
So you would need to set a high value for all the results to be returned.
What you should do is to first create a SolrQuery shown below and set the number of documents you want to fetch in a batch.
int lastResult=0; //this is for processing the future batch
String query = "id:[ lastResult TO *]"; // just considering id for the sake of simplicity
SolrQuery solrQuery = new SolrQuery(query).setRows(500); //setRows will set the required batch, you can change this to whatever size you want.
SolrDocumentList results = solrClient.query(solrQuery).getResults(); //execute this statement
Here I am considering an example of search by id, you can replace it with any of your parameter to search upon.
The "lastResult" is the variable you can change after execution of the first 500 records(500 is the batch size) and set it to the last id got from the results.
This will help you execute the next batch starting with last result from previous batch.
Hope this helps. Shoot up a comment below if you need any clarification.
For selecting all documents in dismax/edismax via Solarium php client, the normal query syntax : does not work. To select all documents set the default query value in solarium query to empty string. This is required as the default query in Solarium is :. Also set the alternative query to :. Dismax/eDismax normal query syntax does not support :, but the alternative query syntax does.
For more details following book can be referred
http://www.packtpub.com/apache-solr-php-integration/book
As the other answers pointed out, you can configure the rows to be max integer to yield back all the results for a query.
I would recommend though to use Solr feature of pagination, and build a function that will return for you all the results using the cursorMark API. The gist of it is you set the cursorMark parameter to '*', you set the page size(rows parameter), and on each result you'll get a cursorMark for the next page, so you execute the same query only with the cursorMark given from the last result. This way you'll have more flexibility on how much of the results you want back, in a much more performant way.
The way I dealt with the problem is by running the query twice:
// Start with your (usually small) default page size
solrQuery.setRows(50);
QueryResponse response = solrResponse(query);
if (response.getResults().getNumFound() > 50) {
solrQuery.setRows(response.getResults().getNumFound());
response = solrResponse(query);
}
It makes a call twice to Solr, but gets you all matching records....with the small performance penalty.
query.setRows(Integer.MAX_VALUE);
works for me!!

What's your experience developing on Google App Engine?

Is GQL easy to learn for someone who knows SQL? How is Django/Python? Does App Engine really make scaling easy? Is there any built-in protection against "GQL Injections"? And so on...
I'd love to hear the not-so-obvious ups and downs of using app engine.
Cheers!
My experience with google app engine has been great, and the 1000 result limit has been removed, here is a link to the release notes:
app-engine release notes
No more 1000 result limit - That's
right: with addition of Cursors and
the culmination of many smaller
Datastore stability and performance
improvements over the last few months,
we're now confident enough to remove
the maximum result limit altogether.
Whether you're doing a fetch,
iterating, or using a Cursor, there's
no limits on the number of results.
The most glaring and frustrating issue is the datastore api, which looks great and is very well thought out and easy to work with if you are used to SQL, but has a 1000 row limit across all query resultsets, and you can't access counts or offsets beyond that. I've run into weirder issues, with not actually being able to add or access data for a model once it goes beyond 1000 rows.
See the Stack Overflow discussion about the 1000 row limit
Aral Balkan wrote a really good summary of this and other problems
Having said that, app engine is a really great tool to have at ones disposal, and I really enjoy working with it. It's perfect for deploying micro web services (eg: json api's) to use in other apps.
GQL is extremely simple - it's a subset of the SQL 'SELECT' statement, nothing more. It's only a convenience layer over the top of the lower-level APIs, though, and all the parsing is done in Python.
Instead, I recommend using the Query API, which is procedural, requires no run-time parsing, and makes 'GQL injection' vulnerabilities totally impossible (though they are impossible in properly written GQL anyway). The Query API is very simple: Call .all() on a Model class, or call db.Query(modelname). The Query object has .filter(field_and_operator, value), .order(field_and_direction) and .ancestor(entity) methods, in addition to all the facilities GQL objects have (.get(), .fetch(), .count()), etc.) Each of the Query methods returns the Query object itself for convenience, so you can chain them:
results = MyModel.all().filter("foo =", 5).order("-bar").fetch(10)
Is equivalent to:
results = MyModel.gql("WHERE foo = 5 ORDER BY bar DESC LIMIT 10").fetch()
A major downside when working with AppEngine was the 1k query limit, which has been mentioned in the comments already. What I haven't seen mentioned though is the fact that there is a built-in sortable order, with which you can work around this issue.
From the appengine cookbook:
def deepFetch(queryGen,key=None,batchSize = 100):
"""Iterator that yields an entity in batches.
Args:
queryGen: should return a Query object
key: used to .filter() for __key__
batchSize: how many entities to retrieve in one datastore call
Retrieved from http://tinyurl.com/d887ll (AppEngine cookbook).
"""
from google.appengine.ext import db
# AppEngine will not fetch more than 1000 results
batchSize = min(batchSize,1000)
query = None
done = False
count = 0
if key:
key = db.Key(key)
while not done:
print count
query = queryGen()
if key:
query.filter("__key__ > ",key)
results = query.fetch(batchSize)
for result in results:
count += 1
yield result
if batchSize > len(results):
done = True
else:
key = results[-1].key()
The above code together with Remote API (see this article) allows you to retrieve as many entities as you need.
You can use the above code like this:
def allMyModel():
q = MyModel.all()
myModels = deepFetch(allMyModel)
At first I had the same experience as others who transitioned from SQL to GQL -- kind of weird to not be able to do JOINs, count more than 1000 rows, etc. Now that I've worked with it for a few months I absolutely love the app engine. I'm porting all of my old projects onto it.
I use it to host several high-traffic web applications (at peak time one of them gets 50k hits a minute.)
Google App Engine doesn't use an actual database, and apparently uses some sort of distributed hash map. This will lend itself to some different behaviors that people who are accustomed to SQL just aren't going to see at first. So for example getting a COUNT of items in regular SQL is expected to be a fast operation, but with GQL it's just not going to work the same way.
Here are some more issues:
http://blog.burnayev.com/2008/04/gql-limitations.html
In my personal experience, it's an adjustment, but the learning curve is fine.

Resources