NDB cursors not remembering some query data? - google-app-engine

So, I have this query:
results, cursor, more = MyModel.query(
ancestor=mykey,
).order(-MyModel.time).fetch_page(20)
So far so good, data returned is fine etc. Now, let's fetch some more, shall we? Seems logical to do just this:
results, cursor, more = MyModel.query() \
.order(-MyModel.time) \
.fetch_page(20, start_cursor=Cursor(urlsafe=request.cursor))
And... weird things happen. Definetely too many results, unordered results... What's going on?
So I change it to:
results, cursor, more = MyModel.query(ancestor=mykey) \
.order(-MyModel.time) \
.fetch_page(20, start_cursor=Cursor(urlsafe=request.cursor))
Suddenly, wat less results... let's add
.order(-MyModel.time)
And I get what I expected.
Now... Am I missing something here? Shouldn't passing cursor already take care of ordering and ancestor? There is ordering example for fetching the initial page in the documentation - https://cloud.google.com/appengine/docs/python/ndb/queries#cursors - but nowhere it is said, that subsequent pages also require ordering to be set. I would just like to know, if that is really working as intended, or it's a bug?
If it's really working as intended, is there anywhere I can read about what information exactly is stored in cursor? Would be really helpful to avoid bugs like this in future.

From Query Cursors (highlight from me):
A query cursor is a small opaque data structure representing a
resumption point in a query. This is useful for showing a user a
page of results at a time; it's also useful for handling long jobs
that might need to stop and resume. A typical way to use them is with
a query's fetch_page() method. It works somewhat like fetch(), but it
returns a triple (results, cursor, more). The returned more flag
indicates that there are probably more results; a UI can use this, for
example, to suppress a "Next Page" button or link. To request
subsequent pages, pass the cursor returned by one fetch_page() call
into the next.
A cursor exists (and makes sense) only in the context of the original query from which it was produced, you can't use the cursor produced in the context of one query (the ancestor query in your case) to navigate results from another query (your non-ancestor query). I mean it might not barf (as your experiment proves) but the results are likely not what you expect :)
Fundamentally the cursor simply represents the current position (index if you want) in the list of the query's result. Using that index in some other list might not crash, but won't make a lot of sense either (unless specifically designed to).
Probably a good habit to use a variable to store the query for re-use instead of re-building it every time, to avoid such accidental mistakes. As illustrated in the snippets.py example on that doc:
# Set up.
q = Bar.query()
q_forward = q.order(Bar.key)
q_reverse = q.order(-Bar.key)
# Fetch a page going forward.
bars, cursor, more = q_forward.fetch_page(10)
# Fetch the same page going backward.
r_bars, r_cursor, r_more = q_reverse.fetch_page(10, start_cursor=cursor)
Side note: this example actually uses the cursor from one query to navigate results in another query, but the 2 queries are designed to be "compatible".

Related

increase performance of a linq query using contains

I have a winforms app where I have a Telerik dropdownchecklist that lets the user select a group of state names.
Using EF and the database is stored in Azure SQL.
The code then hits a database of about 17,000 records and filters the results to only include states that are checked.
Works fine. I am wanting to update a count on the screen whenever they change the list box.
This is the code, in the itemCheckChanged event:
var states = stateDropDownList.CheckedItems.Select(i => i.Value.ToString()).ToList();
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop).ToArray();
ExportInfo_tb.Text = "Current Stop Count: " + filteredStops.Count();
It works, but it is slow.
I tried to load everything into a memory variable then querying that vs the database but can't seem to figure out how to do that.
Any suggestions?
Improvement:
I picked up a noticeable improvement by limiting the amount of data coming down by:
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop.Stop_state).ToList();
And better yet --
int count = (from stop in aDb.Stop_address_details where
states.Contains(stop.Stop_state)
select stop).Count();
ExportInfo_tb.Text = "Current Stop Count: " + count.ToString();
The performance of you query, actually, has nothing to do with Contiains, in this case. Contains is pretty performant. The problem, as you picked up on in your third solution, is that you are pulling far more data over the network than required.
In your first solution you are pulling back all of the rows from the server with the matching stop state and performing the count locally. This is the worst possible approach. You are pulling back data just to count it and you are pulling back far more data than you need.
In your second solution you limited the data coming back to a single field which is why the performance improved. This could have resulted in a significant improvement if your table is really wide. The problem with this is that you are still pulling back all the data just to count it locally.
In your third solution EF will translate the .Count() method into a query that performs the count for you. So the count will happen on the server and the only data returned is a single value; the result of count. Since network latency CAN often be (but is not always) the longest step when performing a query, returning less data can often result in significant gains in query speed.
The query translation of your final solution should look something like this:
SELECT COUNT(*) AS [value]
FROM [Stop_address_details] AS [t0]
WHERE [t0].[Stop_state] IN (#p0)

Can I use Python App Engine Memcache's cas (compare and set) when a value isn't yet cached?

I want to cache the results of an ndb query in memcache. When I do something that would cause the results to change, I delete the key to invalidate it. When I need it, I check if it has been cached and, if not, I do the query and cache it.
However, there is a race condition. Sample code is below:
def Invalidate():
memcache.delete(KEY)
def GetFromCacheOrQuery():
client = memcache.Client()
if not client.get(KEY, for_cas=True):
result = DoQuery() # This is slow, so it should be cached
client.cas(KEY, result)
The race condition if I were to use client.set rather than client.cas:
Someone runs Invalidate.
GetFromCacheOrQuery stores DoQuery() in result
Someone runs Invalidate. Now, the results from DoQuery are INVALID and SHOULD NOT be cached.
GetFromCacheOrQuery resumes, caching the result that is invalid.
I believe that cas solves the race condition, but cas never actually sets the value if the previous get returned None (which is the case if the value is deleted).
Is there a way to get around this? Should I just store a custom value for invalidation rather than deleting the key, or is there a cleaner way?
Thanks!
Sounds like you're describing the gumball solution described in excruciating detail here: http://abineshtd.blogspot.com/2012/09/gumball-race-condition-prevention.html
Yes, a good way to solve this problem is to store a value with a short timeout that means "this value is invalidated" and then don't overwrite it -- let it expire. This means caching won't work for a little bit, but that's better than getting bad data.

App engine datastore inconsistent?

This is so weird...
First of all this query works in the datastore viewer, ie. it returns the correct row.
SELECT * FROM Level where short_id = 'Ec71eN'
But if I run this
Level.all().filter("short_id = ", 'Ec71eN').get()
it returns None, if I run this:
db.GqlQuery("SELECT * FROM Level where short_id = '%s'" % 'Ec71eN').get()
it also returns None. If I run this:
level = Level.get_by_id(189009)
it returns the correct row (189009 is the id for the correct row)
Puzzling? What can be wrong here? I have never seen anything like this before, it has worked correctly for at least a couple of weeks in production... I think I have at least two cases now where it dosent work starting today.
UPDATE: This can not be a eventually consistent problem since the row was 7 hours old when I tried the above. I had two rows with same symptoms, strangely booth generated by the same users. They where booth "fixed" after I did a manual fecth of their ids by uploading special case code like:
if short_id==CASE_1_SHORT_ID:
level = Level.get_by_id(CASE_1_ID)
After that the query worked as usual.
Are you using the HRD? Nothing's wrong. You know it's supposed to be eventually consistent right?
Query operations are eventually consistent.
Get-by-id operations are fully consistent.
What you describe is correct datastore behavior. It's a bit odd that the datastore viewer operation returns the correct result, but it might have hit a separate tablet on the datastore operation.
Given that it was created 7 hours ago, the 'eventual consistency' generally should take seconds to minutes.
If eventual consistency IS the problem, run the same query method a bunch of times and see if returns the same result. If it continuously returns the same result with the same method, then it is more than likely not an eventual consistency problem. You should switch to the NDB API for querying data as well - it's 1000 times better and Guido worked on it - so you know it's good. Does NDB show the same inconsistency?

How to get all results from solr query?

I executed some query like "Address:Jack*". It show numFound = 5214 and display 100 documents in results page(I changed default display results from 10 to 100).
How can I get all documents.
I remember myself doing &rows=2147483647
2,147,483,647 is integer's maximum value. I recall using a number bigger than that once and having a NumberFormatException because it couldn't be parsed into an int. I don't know if they use Long nowadays, but 2 billion rows is normally more than enough.
Small note:
Be careful if you are planning to do this in production. If you do a query like * : * and your index is big, you could transferring a couple of gigabytes in that query.
If you know you won't have many docs, go ahead and use integer's max value.
On the other hand, if you are doing a one-time script and just need to dump all results (for example document ID's) then this approach is valid, if you don't mind waiting 3-5 minutes for a query to return.
Don't use &rows=2147483647
Don't use Integer.MAX_VALUE(2147483647) as value of rows in production. This will heavily slow down your query even if you have a small resultset, because solr preallocates a queue in this size. see https://issues.apache.org/jira/browse/SOLR-7580
I strongly suggest to use Exporting Result Sets
It’s possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Or I suggest to use Deep Paging.
Simple Pagination is a easy thing when you have few documents to read and all you have to do is play with start and rows parameters. But this is not a feasible way when you have many documents, I mean hundreds of thousands or even millions.
This is the kind of thing that could bring your Solr server to their knees.
For typical applications displaying search results to a human user,
this tends to not be much of an issue since most users don’t care
about drilling down past the first handful of pages of search results
— but for automated systems that want to crunch data about all of the
documents matching a query, it can be seriously prohibitive.
This means that if you have a website and are paging search results, a real user do not go so further but consider on the other hand what can happen if a spider or a scraper try to read all the website pages.
Now we are talking of Deep Paging.
I’ll suggest to read this amazing post:
https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
And take a look at this document page:
https://solr.apache.org/guide/pagination-of-results.html
And here is an example that try to explain how to paginate using the cursors.
SolrQuery solrQuery = new SolrQuery();
solrQuery.setRows(500);
solrQuery.setQuery("*:*");
solrQuery.addSort("id", ORDER.asc); // Pay attention to this line
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (!done) {
solrQuery.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrClient.query(solrQuery);
String nextCursorMark = rsp.getNextCursorMark();
for (SolrDocument d : rsp.getResults()) {
...
}
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Returning all the results is never a good option as It would be very slow in performance.
Can you mention your use case ?
Also, Solr rows parameter helps you to tune the number of the results to be returned.
However, I don't think there is a way to tune rows to return all results. It doesn't take a -1 as value.
So you would need to set a high value for all the results to be returned.
What you should do is to first create a SolrQuery shown below and set the number of documents you want to fetch in a batch.
int lastResult=0; //this is for processing the future batch
String query = "id:[ lastResult TO *]"; // just considering id for the sake of simplicity
SolrQuery solrQuery = new SolrQuery(query).setRows(500); //setRows will set the required batch, you can change this to whatever size you want.
SolrDocumentList results = solrClient.query(solrQuery).getResults(); //execute this statement
Here I am considering an example of search by id, you can replace it with any of your parameter to search upon.
The "lastResult" is the variable you can change after execution of the first 500 records(500 is the batch size) and set it to the last id got from the results.
This will help you execute the next batch starting with last result from previous batch.
Hope this helps. Shoot up a comment below if you need any clarification.
For selecting all documents in dismax/edismax via Solarium php client, the normal query syntax : does not work. To select all documents set the default query value in solarium query to empty string. This is required as the default query in Solarium is :. Also set the alternative query to :. Dismax/eDismax normal query syntax does not support :, but the alternative query syntax does.
For more details following book can be referred
http://www.packtpub.com/apache-solr-php-integration/book
As the other answers pointed out, you can configure the rows to be max integer to yield back all the results for a query.
I would recommend though to use Solr feature of pagination, and build a function that will return for you all the results using the cursorMark API. The gist of it is you set the cursorMark parameter to '*', you set the page size(rows parameter), and on each result you'll get a cursorMark for the next page, so you execute the same query only with the cursorMark given from the last result. This way you'll have more flexibility on how much of the results you want back, in a much more performant way.
The way I dealt with the problem is by running the query twice:
// Start with your (usually small) default page size
solrQuery.setRows(50);
QueryResponse response = solrResponse(query);
if (response.getResults().getNumFound() > 50) {
solrQuery.setRows(response.getResults().getNumFound());
response = solrResponse(query);
}
It makes a call twice to Solr, but gets you all matching records....with the small performance penalty.
query.setRows(Integer.MAX_VALUE);
works for me!!

Google App Engine - Delete until count() <= 0

What is the difference between these 2 pieces of code?
query=Location.all(keys_only=True)
while query.count()>0:
db.delete(query.fetch(5))
# --
while True:
query=Location.all(keys_only=True)
if not query.count():
break
db.delete(query.fetch(5))
They both work.
Logically, these two pieces of code perform the same exact thing - they delete every Location entity, 5 at a time.
The first piece of code is better both in terms of style and (slightly) in terms of performance. (The query itself does not need to be rebuilt in each loop).
However, this code is not as efficient as it could be. It has several problems:
You use count() but do not need to. It would be more efficient to simply fetch the entities, and then test the results to see if you got any.
You are making more round-trips to the datastore than you need to. Each count(), fetch(), and delete() call must go the datastore and back. These round-trips are slow, so you should try to minimize them. You can do this by fetching more entities in each loop.
Example:
q = Location.all(keys_only=True)
results = q.fetch(500)
while results:
db.delete(results)
results = q.fetch(500)
Edit: Have a look at Nick's answer below - he explains why this code's performance can be improved even more by using query cursors.
Here's a solution that's neater, but you may or may not consider to be a hack:
q = Location.all(keys_only=True)
for batch in iter(lambda: q.fetch(500), []):
db.delete(batch)
One gotcha, however, is that as you delete more and more, the backend is forced to skip over the 'tombstoned' entities to find the next ones that aren't deleted. Here's a more efficient solution that uses cursors:
q = Location.all(keys_only=True)
results = q.fetch(500)
while results:
db.delete(results)
q = Location.all(keys_only=True).with_cursor(q.cursor())
results = q.fetch(500)
In the second one, query will be assigned/updated in every loop. I don't know if this is needed with the logic behind it (I don't use google app engine). To replicate this behaviour, the first one would have to look like this:
query=Location.all(keys_only=True)
while query.count()>0:
db.delete(query.fetch(5))
query=Location.all(keys_only=True)
In my oppinion, the first style is way more readable than the second one.
It's too bad you can't do this in python.
query=Location.all(keys_only=True)
while locations=query.fetch(5):
db.delete(locations)
Like in the other P language
while(#row=$sth->fetchrow_array){
do_something();
}

Resources