I'm using both ndb and search-api queries in my python appengine project.
The only official docs on cursors I can find:
https://cloud.google.com/appengine/docs/python/datastore/query-cursors
https://cloud.google.com/appengine/docs/python/search/cursorclass
Following things are unclear for me:
What is cursor time-to-live ? Can I expose year-old cursors ?
How would cursor pagination behave in case items are added/removed from original collection? (+ if cursor points to particular record, what happens if this record no longer exists?)
How does query ordering affect above?
Are there any fundamental differences between ndb and search-api cursors?
I'm answering from ndb perspective, I haven't use the search API. All quotes are from your first link.
For 1 and 3 (as ordering is considered part of the original query from cursors perspective):
To retrieve additional results from the point of the cursor, the
application prepares a similar query with the same entity kind,
filters, and sort orders, and passes the cursor to the query's
with_cursor() method before performing the retrieval
So it doesn't really matter how old the cursor is (i.e. how old its query is) since its original query must be restored for the cursor to be obtained.
For 2:
Cursors and data updates
The cursor's position is defined as the location in the result list
after the last result returned. A cursor is not a relative position in
the list (it's not an offset); it's a marker to which Cloud Datastore
can jump when starting an index scan for results. If the results for a
query change between uses of a cursor, the query notices only changes
that occur in results after the cursor. If a new result appears before
the cursor's position for the query, it will not be returned when the
results after the cursor are fetched. Similarly, if an entity is no
longer a result for a query but had appeared before the cursor, the
results that appear after the cursor do not change. If the last result
returned is removed from the result set, the cursor still knows how to
locate the next result.
When retrieving query results, you can use both a start cursor and an
end cursor to return a continuous group of results from Cloud
Datastore. When using a start and end cursor to retrieve the results,
you are not guaranteed that the size of the results will be the same
as when you generated the cursors. Entities may be added or deleted
from Cloud Datastore between the time the cursors are generated and
when they are used in a query.
The Java equivalent page at Limitations of cursors mentions some errors that can be raised for inconsistencies:
New App Engine releases might change internal implementation details,
invalidating cursors that depend on them. If an application attempts
to use a cursor that is no longer valid, Cloud Datastore raises an
IllegalArgumentException (low-level API), JDOFatalUserException
(JDO), or PersistenceException (JPA).
I suspect Python would be raising some similar errors as well.
Related
I have a question about getting Cursor
Target function:
https://godoc.org/google.golang.org/appengine/datastore#Iterator.Cursor
As far as can be read from the following code, offset is set when getting a Cursor
https://github.com/golang/appengine/blob/master/datastore/query.go#L702-L705
When I checked the result when this function was executed with the stack trace of GCP console, Insights displays a warning
Issue: Use of offset in datastore queries.
Description: Your app made 1 remote procedure calls to datastore.query () and datastore.next () using offset.
Recommendation: Use cursor instead of offset.
Query Details
g.co/gae/datastore/offset 10
g.co/gae/datastore/skipped 10
offset affects performance and billing, I want to avoid this behavior
Is there a way to avoid using offset? Or is this the correct behavior?
From Offsets versus cursors:
Although Cloud Datastore supports integer offsets, you should avoid
using them. Instead, use cursors. Using an offset only avoids
returning the skipped entities to your application, but these entities
are still retrieved internally. The skipped entities do affect the
latency of the query, and your application is billed for the read
operations required to retrieve them. Using cursors instead of offsets
lets you avoid all these costs.
The q.offset you're referring to is an internal variable used for the Cursor implementation, it's not the explicit query offset that the above quote mentions.
So you should be fine using Cursor.
I understand that there's a limitation with App Engine's datastore cursor I am curious how people manage to retrieve the result sets under this limitation.
My scenario is that I need to run a query with both the "or" operator and NOT_EQUAL multiple times. However, since the cursor is null, I cannot retrieve the next set of records.
P.S. I am using Objectify as well, but haven't found any documentation whether Objectify has a workaround.
Thanks!
For queries with NOT_EQUAL you can drop that particular element from the query to make it cursor-capable and implement the equivalent check in the result entity processing code instead (i.e. skip processing the entity if the corresponding EQUAL condition is true, for example).
To address the or limitation you can perform multiple separate cursor-capable queries for each of the or elements and make the result entity processing code idempotent by either:
tracking or flagging the processed entities to ensure that entities appearing in multiple of the separate query results are only processed once
having processing code produce the same result even if an entity is processed multiple times
The 2 techniques can be combined if needed - as in your case.
Of course, they are neither as convenient nor as efficient as without the cursor limitation ;)
In my application I have a datastore query with a filter, such as:
datastore.NewQuery("sometype").Filter("SomeField<", 10)
I'm using a cursor to iterate batches of the result (e.g in different tasks). If the value of SomeField is changed while iterating over it, the cursor will no longer work on google app engine (works fine on devappserver).
I have a test project here: https://github.com/fredr/appenginetest
In my test I ran /db that will setup the db with 10 items with their values set to 0, then ran /run/2 that will iterate over all items where the value is less than 2, in batches of 5, and update the value of each item to 2.
The result on my local devappserver (all items are updated):
The result on appengine (only five items are updated):
Am I doing something wrong? Is this a bug? Or is this the expected result?
In the documentation it states:
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
The problem is the nature and implementation of the cursors. The cursor contains the key of the last processed entity (encoded), and so if you set a cursor to your query before executing it, the Datastore will jump to the entity specified by the key encoded in the cursor, and will start listing entities from that point.
Let's examine your case
Your query filter is Value<2. You iterate over the entities of the query result, and you change (and save) the Value property to 2. Note that Value=2 does not satisfy the filter Value<2.
In the next iteration (next batch) a cursor is present which you apply properly. Therefore when the Datastore executes the query, it jumps to the last entity processed in the previous iteration, and wants to list entities that come after this. But the entity pointed by the cursor may already not satisfy the filter; because the index entry for its new Value 2 will most likely be already updated (non-deterministic behavior - see eventual consistency for more details which applies here because you did not use an Ancestor query which would guarantee strongly consistent results; the time.Sleep() delay just increases the probability of this).
So the Datastore sees that the last processed entity does not satisfy the filter and will not search all the entities again but report that no more entities are matching the filter, hence no more entities will be updated (and no errors wil be reported).
Suggestion: don't use cursors and filter or sort by the same property you're updating at the same time.
By the way:
The part from the Appengine docs you quoted:
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
This is not what you think. This means: cursors may not work properly on a property which has multiple values AND the same property is either included in an inequality filter or is used to sort the results by.
By the way #2
In the screenshot you are using SDK 1.9.17. The latest SDK version is 1.9.21. You should update it and always use the latest available version.
Alternatives to achieve your goal
1) Don't use cursors
If you have many records, you won't be able to update all your entities in one step (in one loop), but let's say you update 300 entities. If you repeat the query, the already updated entities will not be in the results of executing the same query again because the updated Value=2 does not satisfy the filter Value<2. Just redo the query+update until the query has no results. Since your change is idempotent, it would not cause any harm if the update of the index entry of an entity is delayed and would get returned by the query multiple times. It would be best to delay the execution of the next query to minimize the chance of this (e.g. wait a few seconds between redoing the query).
Pros: Simple. You already have the solution, just exclude the cursor handling part.
Cons: Some entities might get updated multiple times (therefore the change must be idempotent). Also the change performed on entities must be something which will exclude the entity from the next query.
2) Using Task Queue
You could first execute a keys-only query and defer the update to using tasks. You could create tasks with let's say passing 100 keys to each, and the tasks could load the entities by key and do the update. This would ensure each entity would only get updated once. This solution would have a little bigger delay due to involving the task queue, but that is not a problem in most cases.
Pros: No duplicated updates (therefore change may be non-idempotent). Works even if the change to be performed would not exclude the entity from the next query (more general).
Cons: Higher complexity. Bigger lag/delay.
3) Using Map-Reduce
You could use the map-reduce framework/utility to do massively parallel processing of many entities. Not sure if it has been implemented in Go.
Pros: Parallel execution, can handle even millions or billions of entities. Much faster in case of large entity number. Plus pros listed at 2) Using Task Queue.
Cons: Higher complexity. Might not be available in Go yet.
In the App Engine Documentation I found an interesting strategy for keeping up to date with changes in the datastore by using Cursors:
An interesting application of cursors is to monitor entities for unseen changes. If the app sets a timestamp property with the current date and time every time an entity changes, the app can use a query sorted by the timestamp property, ascending, with a Datastore cursor to check when entities are moved to the end of the result list. If an entity's timestamp is updated, the query with the cursor returns the updated entity. If no entities were updated since the last time the query was performed, no results are returned, and the cursor does not move.
However, I'm not quite sure how this can always work. After all, when using the High Replication Datastore, queries are only eventually consistent. So if two entities are put, and only the later of the two is seen by the query, it will move the cursor past both of them. Which will mean that the first of the two new entities will remain unseen.
So is this an actual issue? Or is there some other way that cursors work around this?
Having an index, builtin or composite, on a property that contains a monotonically increasing value (such as the current timestamp) may not perform as well as you may want at high write rates. This type of workload will generate a hotspot, as the tail of the index is constantly being updated as opposed to the load being distributed throughout the sorted index. However, for low write-rates, this will work fine.
The rest of the answer will depend on whether you are in the same entity group or separate entity groups.
If your query is an ancestor query, and thus in the same entity group it can be strongly consistent (by default they are), and the described method should always be accurate. The query will immediately see any writes (changes to an entity inside the entity group).
If you are querying over many entities groups, which is always eventually consistent, then there is no guarantee what order the writes are applied/visible. For example:
- Time1 - Write EntityA
- Time2 - Write EntityB
- Time3 - Query only sees EntityB
- Time4 - Query sees EntityA and EntityB
So the method of using a cursor to detect a change is correct, but it may "skip" over some changes.
For more information on eventual/strong consistency, see Balancing Strong and Eventual consistency with Google Cloud Datastore
You'll probably be best informed if you could ask someone who's worked on it, but after thinking about it a bit and re-reading Paxos a bit, I think it should not be a problem, though it would depend on how the datastore's actually implemented.
A cursor is essentially a position in the index. In theory you can re-read the same cursor over and over, and see new entities start appearing after it. In the real world case, you'll generally move on to the newest cursor position and forget about the old cursor position.
Eventual consistency "problems" appear because there's multiple copies of the index spread across multiple machines. Depending on which index you read from, you may get stale results.
You describe a problem case where there are two (exact) copies of an index I, and two new entities are created, E1, and E2. Say I1 = I + E1 and I2 = I + E2, so depending on the index you read from, you might get E1 or E2 as the new entity, move your cursor, and miss an entity when the index gets "patched" with the other index, ie I2 eventually gets patched to I + E1 + E2.
If the datastore actually happens that way, then I suspect, yes, you can get a problem. However, it sounds very difficult to operate that way, and I suspect the datastore indexes only get updated after the Paxos voting comes to an agreement. So you'll never seen an out-of-order index, you'll only see entities show up late: ie, you'll never see I + E2, you'll only ever see (I) or (I + E1) or (I + E1 + E2)
I suspect though, you might get a problem where you may be able to have a cursor that's too new for an index that hasn't caught up yet.
When using the google app engine search API, if we have a query that returns a large result set (>1000), and need to iterate using the cursor to collect the entire result set, we are getting indeterminate results for the documents returned if the number_found_accuracy is lower than our result size.
In other words, the same query ran twice, collecting all the documents via cursors, does not return the same documents, UNLESS our number_found_accuracy is higher than the result size (ex, using the 10000 max). Only then do we always get the same documents.
Our understanding of how the number_found_accuracy is supposed to work is that it would only affect the number_found estimation. We assumed that if you use the cursor to get all the results, you would be able to get the same results as if you had run one large query.
Are we mis-understanding the use of the number_found_accuracy or cursors, or have we found a bug?
Your understanding of number_found_accuracy is correct. I think that the behavior you're observing is the surprising interplay between replicated query failover and how queries that specify number_found_accuracy affect future queries using continuation tokens.
When you index documents using the Search API, we store them in several distinct replicas using the same mechanism as the High Replication Datastore (i.e., Megastore). How quickly those documents become live on each replica depends on many factors. It's usually immediate, but the delay can become much longer if you're doing batch writes to a single (index, namespace) pair.
Searches can get executed on any of these replicas. We'll even potentially run a search that uses a continuation token on a different replica than the original search. If the original replica and/or continuation replica are catching up on their indexing work, they might have different sets of live documents. It will become consistent "eventually" but it's not always immediate.
If you specify number_found_accuracy on a query, we have to run most of the query as if we're going to return number_found_accuracy results. We specifically have to read much further down the posting lists to find and count matching documents. Reading a posting list results in its associated file block being inserted into various caches.
In turn, when you do the search using a cursor we'll be able to read the document for real much more quickly on the same replica that we'd used for the original search. You're thus less likely to have the continuation search failover to a different replica that might not have finished indexing the same set of documents. I think that the inconsistent results you've observed are the result of this kind of continuation query failover.
In summary, setting number_found_accuracy to something large effectively prewarms that replica's cache. It will thus almost certainly be the fastest replica for a continuation search. In the face of replicas that are trying to catch up on indexing, that will give the appearance that number_found_accuracy has a direct effect on the consistency of results, but in reality it's just a side-effect.