Need ideas to manage query results (cursor) with inequality operations - google-app-engine

I understand that there's a limitation with App Engine's datastore cursor I am curious how people manage to retrieve the result sets under this limitation.
My scenario is that I need to run a query with both the "or" operator and NOT_EQUAL multiple times. However, since the cursor is null, I cannot retrieve the next set of records.
P.S. I am using Objectify as well, but haven't found any documentation whether Objectify has a workaround.
Thanks!

For queries with NOT_EQUAL you can drop that particular element from the query to make it cursor-capable and implement the equivalent check in the result entity processing code instead (i.e. skip processing the entity if the corresponding EQUAL condition is true, for example).
To address the or limitation you can perform multiple separate cursor-capable queries for each of the or elements and make the result entity processing code idempotent by either:
tracking or flagging the processed entities to ensure that entities appearing in multiple of the separate query results are only processed once
having processing code produce the same result even if an entity is processed multiple times
The 2 techniques can be combined if needed - as in your case.
Of course, they are neither as convenient nor as efficient as without the cursor limitation ;)

Related

GAE Push Queue database contention during datastore query

Summary
I have an issue where the database writes from my task queue (approximately 60 tasks, at 10/s) are somehow being overwritten/discarded during a concurrent database read of the same data. I will explain how it works. Each task in the task queue assigns a unique ID to a specific datastore entity of a model.
If I run a indexed datastore query on the model and loop through the entities while the task queue is in progress, I would expect that some of the entities will have been operated on by the task queue (ie.. assigned an ID) and others are still yet-to-be effected. Unfortunately what seems to be happening is during the loop through the query, entities that were already operated on (ie.. successfully assigned an ID) are being overwritten or discarded, saying that they were never operated on, even though -according to my logs- they were operated on.
Why is this happening? I need to be able to read the status of my data without affecting the taskqueue write operation in the background. I thought maybe it was a caching issue so I tried enforcing use_cache=False and use_memcache=False on the query, but that did not solve the issue. Any help would be appreciated.
Other interesting notes:
If I allow the task queue to complete fully before doing a datastore query, and then do a datastore query, it acts as expected and nothing is overwritten/discarded.
This is typically an indication that the write operations to the entities are not performed in transactions. Transactions can detect such concurrent write (and read!) operations and re-try them, ensuring that the data remains consistent.
You also need to be aware that queries (if they are not ancestor queries) are eventually consistent, meaning their results are a bit "behind" the actual datastore information (it takes some time from the moment the datastore information is updated until the corresponding indexes that the queries use are updated accordingly). So when processing entities from query results you should also transactionally verify their content. Personally I prefer to make keys_only queries and then obtain the entities via key lookups, which are always consistent (of course, also in transactions if I intend to update the entities and, on reads, if needed).
For example if you query for entities which don't have a unique ID you may get entities which were in fact recently operated on and have an ID. So you should (transactionally) check if the entity actually has an ID and skip its update.
Also make sure you're not updating entities obtained from projection queries - results obtained from such queries may not represent the entire entities, writing them back will wipe out properties not included in the projection.

Google app engine datastore query with cursor won't iterate all items

In my application I have a datastore query with a filter, such as:
datastore.NewQuery("sometype").Filter("SomeField<", 10)
I'm using a cursor to iterate batches of the result (e.g in different tasks). If the value of SomeField is changed while iterating over it, the cursor will no longer work on google app engine (works fine on devappserver).
I have a test project here: https://github.com/fredr/appenginetest
In my test I ran /db that will setup the db with 10 items with their values set to 0, then ran /run/2 that will iterate over all items where the value is less than 2, in batches of 5, and update the value of each item to 2.
The result on my local devappserver (all items are updated):
The result on appengine (only five items are updated):
Am I doing something wrong? Is this a bug? Or is this the expected result?
In the documentation it states:
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
The problem is the nature and implementation of the cursors. The cursor contains the key of the last processed entity (encoded), and so if you set a cursor to your query before executing it, the Datastore will jump to the entity specified by the key encoded in the cursor, and will start listing entities from that point.
Let's examine your case
Your query filter is Value<2. You iterate over the entities of the query result, and you change (and save) the Value property to 2. Note that Value=2 does not satisfy the filter Value<2.
In the next iteration (next batch) a cursor is present which you apply properly. Therefore when the Datastore executes the query, it jumps to the last entity processed in the previous iteration, and wants to list entities that come after this. But the entity pointed by the cursor may already not satisfy the filter; because the index entry for its new Value 2 will most likely be already updated (non-deterministic behavior - see eventual consistency for more details which applies here because you did not use an Ancestor query which would guarantee strongly consistent results; the time.Sleep() delay just increases the probability of this).
So the Datastore sees that the last processed entity does not satisfy the filter and will not search all the entities again but report that no more entities are matching the filter, hence no more entities will be updated (and no errors wil be reported).
Suggestion: don't use cursors and filter or sort by the same property you're updating at the same time.
By the way:
The part from the Appengine docs you quoted:
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
This is not what you think. This means: cursors may not work properly on a property which has multiple values AND the same property is either included in an inequality filter or is used to sort the results by.
By the way #2
In the screenshot you are using SDK 1.9.17. The latest SDK version is 1.9.21. You should update it and always use the latest available version.
Alternatives to achieve your goal
1) Don't use cursors
If you have many records, you won't be able to update all your entities in one step (in one loop), but let's say you update 300 entities. If you repeat the query, the already updated entities will not be in the results of executing the same query again because the updated Value=2 does not satisfy the filter Value<2. Just redo the query+update until the query has no results. Since your change is idempotent, it would not cause any harm if the update of the index entry of an entity is delayed and would get returned by the query multiple times. It would be best to delay the execution of the next query to minimize the chance of this (e.g. wait a few seconds between redoing the query).
Pros: Simple. You already have the solution, just exclude the cursor handling part.
Cons: Some entities might get updated multiple times (therefore the change must be idempotent). Also the change performed on entities must be something which will exclude the entity from the next query.
2) Using Task Queue
You could first execute a keys-only query and defer the update to using tasks. You could create tasks with let's say passing 100 keys to each, and the tasks could load the entities by key and do the update. This would ensure each entity would only get updated once. This solution would have a little bigger delay due to involving the task queue, but that is not a problem in most cases.
Pros: No duplicated updates (therefore change may be non-idempotent). Works even if the change to be performed would not exclude the entity from the next query (more general).
Cons: Higher complexity. Bigger lag/delay.
3) Using Map-Reduce
You could use the map-reduce framework/utility to do massively parallel processing of many entities. Not sure if it has been implemented in Go.
Pros: Parallel execution, can handle even millions or billions of entities. Much faster in case of large entity number. Plus pros listed at 2) Using Task Queue.
Cons: Higher complexity. Might not be available in Go yet.

Number Found Accuracy on Search API Affecting Cursor Results

When using the google app engine search API, if we have a query that returns a large result set (>1000), and need to iterate using the cursor to collect the entire result set, we are getting indeterminate results for the documents returned if the number_found_accuracy is lower than our result size.
In other words, the same query ran twice, collecting all the documents via cursors, does not return the same documents, UNLESS our number_found_accuracy is higher than the result size (ex, using the 10000 max). Only then do we always get the same documents.
Our understanding of how the number_found_accuracy is supposed to work is that it would only affect the number_found estimation. We assumed that if you use the cursor to get all the results, you would be able to get the same results as if you had run one large query.
Are we mis-understanding the use of the number_found_accuracy or cursors, or have we found a bug?
Your understanding of number_found_accuracy is correct. I think that the behavior you're observing is the surprising interplay between replicated query failover and how queries that specify number_found_accuracy affect future queries using continuation tokens.
When you index documents using the Search API, we store them in several distinct replicas using the same mechanism as the High Replication Datastore (i.e., Megastore). How quickly those documents become live on each replica depends on many factors. It's usually immediate, but the delay can become much longer if you're doing batch writes to a single (index, namespace) pair.
Searches can get executed on any of these replicas. We'll even potentially run a search that uses a continuation token on a different replica than the original search. If the original replica and/or continuation replica are catching up on their indexing work, they might have different sets of live documents. It will become consistent "eventually" but it's not always immediate.
If you specify number_found_accuracy on a query, we have to run most of the query as if we're going to return number_found_accuracy results. We specifically have to read much further down the posting lists to find and count matching documents. Reading a posting list results in its associated file block being inserted into various caches.
In turn, when you do the search using a cursor we'll be able to read the document for real much more quickly on the same replica that we'd used for the original search. You're thus less likely to have the continuation search failover to a different replica that might not have finished indexing the same set of documents. I think that the inconsistent results you've observed are the result of this kind of continuation query failover.
In summary, setting number_found_accuracy to something large effectively prewarms that replica's cache. It will thus almost certainly be the fastest replica for a continuation search. In the face of replicas that are trying to catch up on indexing, that will give the appearance that number_found_accuracy has a direct effect on the consistency of results, but in reality it's just a side-effect.

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

What is SQL Server doing between the time my first record is returned and when my last record is returned?

Say I have a query that returns 10,000 records. When the first record has returned what can I assume about the state of my query?
Has it finished and is just returning records from the server to my instance of SSMS?
Is the query itself still being executed on the server?
What is it that causes the 10,000 records to be slowly returned for one query and nearly instantly for another?
There is potentially some mix of progressive processing on the server side, network transfer of the data, and rendering by the client.
If one query returns 10,000 rows quickly, and another one slowly -- and they are of similar row size, data types, etc., and are both destined for results to grid or results to text -- there is little we can do to analyze the differences unless you show us execution plans and/or client statistics for each one. These are options you can set in SSMS when running a query.
As an aside, switching between results to grid and results to text you might notice slightly different runtimes. This is because in one case Management Studio has to work harder to align the columns etc.
You can not make a generic assumption, a query's plan is composed of a number of different types of operations, or iterators. Some of these are Navigational based, and work like a pipeline, whilst others are set based operations, such as a sort.
If any query contains a set based operation, it requires all the records before it could output the results (i.e an order by clause within your statement.) But if you have no set based iterators you could expect the rows to be streamed to you as they become available.
The answer to each of your individual questions is "it depends."
For example, consider if you include an order by clause, and there isn't an index for the column(s) you're ordering by. In this case, the server has to find all the records that satisfy your query, then sort them, before it can return the first record. This causes a long pause before you get your first record, but you (should normally) get them quite quickly once you start getting any.
Without the order by clause, the server will normally send each record as its found, so the first record will often show up sooner, but you may see a long pause between one record and the next.
As as far simply "why is one query faster than another", a lot depends on what indexes are available, and whether they can be used for a particular query. For example, something like some_column like '%something' will almost always be quite slow. The leading '%' means this won't be able to use an index, even if some_column has one. A search for something% instead of %something% might easily be 100 or 1000 times faster. If you really need the former, you really want to use full-text searching instead (create a full-text index, and use contains() instead of like.
Of course, a lot can also depend simply on whether the database has an index for a particular column (or group of columns). With a suitable index, the query will usually be quite a lot faster.

Resources