ndb.OR makes query costs more - google-app-engine

Using AppEngine appstats I profiled my queries, and noticed that although the docs say a query costs one read, queries using ndb.OR (or .IN which expands to OR), cost n reads (n equals the number of OR clauses).
eg:
votes = (Vote.query(ndb.OR(Vote.object == keys[0], Vote.object == keys[1]))
.filter(Vote.user_id == user_id)
.fetch(keys_only=True))
This query costs 2 reads (it matches 0 entities). If I replace the ndb.OR with Vote.object.IN, the number of reads equals the length of array I pass to read.
This behavior is kind of contradicts the docs.
I was wondering if anyone else experienced the same, and if this is a bug in AE, docs, or my understanding.
Thanks.

The query docs for ndb are not particularly explicit but this paragraph is your best answer
In addition to the native operators, the API supports the != operator,
combining groups of filters using the Boolean OR operation, and the IN
operation, which test for equality to one of a list of possible values
(like Python's 'in' operator). These operations don't map 1:1 to the
Datastore's native operations; thus they are a little quirky and slow,
relatively. They are implemented using in-memory merging of result
streams. Note that p != v is implemented as "p < v OR p > v". (This
matters for repeated properties.)
In this doc https://developers.google.com/appengine/docs/python/ndb/queries

Related

Equivalent of DataSet groupBy/withPartitioner for DataStream

Previously with a DataSet I could do a .groupBy(...) followed by a .withPartitioner(...) to create groups such that one group (known to be much, much bigger than all the others) would be assigned to its own slot, and the other groups would be distributed among the remaining slots.
In switching to a DataStream, I don't see any straightforward way to do the same thing. If I dig into .keyBy(...), I see it using a PartitionTransformation with a KeyGroupStreamPartitioner, which is promising - but PartitionTransformation is an internal-use only class (or so the annotation says).
What's the recommended approach with a DataStream for achieving the same result?
With DataStream it's not as straightforward. You can implement a custom Partitioner that you use with partitionCustom, but then you do not have a KeyedStream, and so can not use keyed state or timers.
Another solution is to do a two-step, local/global aggregation, e.g.,
.keyBy(randomizedKey).process(local).keyBy(key).process(global)
And in some cases, the first level of random keying isn't necessary (if the keys are already well distributed among the source partitions).
In principle, given a priori knowledge of the hot keys, you should be able to somehow implement a KeySelector that does a good job of balancing the load among the task slots. I believe one or two people have actually done this (by brute force searching for a suitable mapping from original keys to actual keys), but I don't have a reference implementation at hand.
As David noted, you can sometimes do the double-keyBy trick (initially using a random key) to reduce the impact of key skew. In my case that wasn't viable, as I'm processing records in each group using a large deep learning network with significant memory requirements, which means having all models loaded at the same time for the first grouping.
I re-used a technique I'd gotten to work with an older version of Flink, where you decide which sub-task (operator index) should get each record, and then calculate a key that Flink will assign to the target sub-task. The code, which calculates an Integer key, looks something like:
public static Integer makeKeyForOperatorIndex(int maxParallelism, int parallelism,
int operatorIndex) {
for (int i = 0; i < maxParallelism * 2; i++) {
Integer key = new Integer(i);
int index = KeyGroupRangeAssignment.assignKeyToParallelOperator(
i, maxParallelism, parallelism);
if (index == operatorIndex) {
return key;
}
}
throw new RuntimeException(String.format(
"Unable to find key for target operator index %d (max parallelism = %d, parallelism = %d",
operatorIndex, maxParallelism, parallelism));
}
But note this is very fragile, as it depends on internal Flink implementation details.

Double negation in Lucene query

Lets say i have a binary field checked
Lets also assume that 3 documents out of 10 has checked:1 others checked:0
When I search in lucene
checked:1 - returns correct result (3)
checked:0 - returns correct result (7)
-checked:1 - returns correct result (7)
-checked:0 - returns correct result (3)
BUT
-(-(checked:1)) - suddenly returns wrong result (10, i.e. entire data set).
Any idea why lucene query parse acts so weird
Each Lucene query has to contain at least one positive term (either MUST/+ or SHOULD) so it matches at least one document. So your queries -checked:1 and -checked:0 are invalid, and I am surprised you are getting any results.
These queries should (most likely) look like this:
+*:* -checked:1
+*:* -checked:0
Getting back to your problem: double negation makes no sense in Lucene. Why would you have double negation, what are you trying to query?
Generally speaking, don't look at Lucene query operators (! & |) as Boolean operators, they aren't exactly what you think they are.
After some research and trial and error and building up on answer from midas, I have came up with the method to resolve this inconsistency. When I say inconsistency, I mean from a common sense view for a user. From information retrieval prospective, midas has linked an interesting article, which explains why such a query makes no sense.
So, the trick is to keep each negative expression with MatchAllDocsQueryNode class, namely the rewritten query has to look like this:
-(-(checked:1 *:*) *:*)
Then the query will produce the expected result. I have accomplished it by writing my own nodeprocessor class, which performs necessary operations.

Order of fields in a lucene query

Does the order of fields matter in a lucene query?
For instance,
q = A && B && C
Lets say A appears in a million documents, B in 10000, C in 1000.
while the results would be identical irrespective of the order in which you AND
A, B and C, will the response times of the following queries differ in any way?
C && B && A
A && B && C
Does Lucene/Solr pick the best query execution plan in terms of both space and time for a given query?
It doesn't matter if query is A AND B AND C or C AND B AND A, the query execution time will be same.
Also if you do an AND , all the query terms need to be be present for the document to be returned, so the Document frequency would be the same.
However, the term frequency would differ and hence the score.
Lucene is " a high-performance full-featured text search engine library [...]" by definition.
Analyzing the number of documents in which each term appears is easy to decide the order in which to perform the AND operations and Lucene and certainly does.
If you are interested in the algorithm, the best performance can be obtained executing the AND between the term with the lowest cardinalities, and goes on till the one with the highest.
In this way, thanks to the merge algorithm on the sorted posting lists [O(n+m) with n and m lengths of the two posting lists] and to the skip pointers, you can iterate over a of smaller number of docIDs.

Short-circuit OR operator in Lucene/Solr

I understand that lucene's AND (&&), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra).
I have been trying to construct a simple OR expression, as follows
q = +(field1:value1 OR field2:value2)
with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses.
How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true.
Strictly speaking, no, there is no short circuiting boolean logic. If a document is found for one term, you can't simply tell it not to check for the other. Lucene is an inverted index, so it doesn't really check documents for matches directly. If you search for A OR B, it finds A and gets all the documents which have indexed that value. Then it gets B in the index, and then list of all documents containing it (this is simplifying somewhat, but I hope it gets the point across). It doesn't really make sense for it to not check the documents in which A is found. Further, for the query provided, all the matches on a document still need to be enumerated in order to acquire a correct score.
However, you did mention scores! I suspect what you are really trying to get at is that if one query term in a set is found, to not compound the score with other elements. That is, for (A OR B), the score is either the score-A or the score-B, rather than score-A * score-B or some such (Sorry if I am making a wrong assumption here, of course).
That is what DisjunctionMaxQuery is for. Adding each subquery to it will render a score from it equal to the maximum of the scores of all subqueries, rather than a product.
In Solr, you should learn about the DisMaxQParserPlugin and it's more recent incarnation, the ExtendedDisMax, which, if I'm close to the mark here, should serve you very well.

Query Term elimination

In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...

Resources