GAE Search API. Obtain total amount of matching documents

GAE Search API. Obtain total amount of matching documents - google-app-engine

Hi,
I am using GAE Search API, and it seems to be a really great feature, which by the way adds so vital functionality lacked in standard datastore queries.But i have faced a problem to implement a standard pagination, namely to get a total amount of matching the query documents. Certainly, i can implement a list with a "show more" button using Cursor, but it would be also great to be able to obtain a total amount.
Any ideas on how to do this?
Thank you very much in advance!

Step 1:
set your accuracy
QueryOptions options = QueryOptions.newBuilder()
...set other options
.setNumberFoundAccuracy(1000);
.build();
Sets the accuracy requirement for Results.getNumberFound(). If set,
getNumberFound() will be accurate up to at least that number. For
example, when set to 100, any getNumberFound() <= 100 is accurate.
This option may add considerable latency / expense, especially when
used with setFieldsToReturn(String...).
Step 2 run query
Query query = Query.newBuilder().setOptions(options).build(queryString);
Results<ScoredDocument> results = getIndex().search(query);
Step 3 call getNumberFound()
results.getNumberFound();
The number of results found by the search. If the value is less than
or equal to the corresponding QueryOptions.getNumberFoundAccuracy(),
then it is accurate, otherwise it is an approximation Returns: the
number of results found

Related

Does `fetch_page()` just not guarantee a minimum amount of results when used with `filter()`?

In the documentation we have:
page_size: At most this many results will be returned.
It looks like when using filter along with fetch_page, it doesn't return a minimum of results, even though there are more results which actually match the query. Is that really the case?
Is it possible for fetch_page to result zero results, even though if keep going by continuing from the returned cursor, we'll find more results eventually?
And, if that's the case, and I need a minimum amount of results, does it mean that I have to "manually" accumulate results until I get to the desired number of entries? Or is there are a feature in NDB which will allow me to "automatically" accumulate results until I have a certain minimum number of results?
Here's the code in question I'm using:
results, cursor, more = (cls.query(keys_only=True)
.filter(cls.user_id == user_id)
.filter(cls.expired == False)
.order(ordering)
.fetch_page(batch_size, start_cursor=start_cursor))
In my test environment most of the results saved in the database don't match the filters, but there are still quite a few that do, and don't appear

Starting with your query, it either uses a specific composite index or follows a merge join algorithm. The performance of merge join queries is described at https://cloud.google.com/datastore/docs/concepts/optimize-indexes#index_merge_performance .
As noted in that doc, it's possible for the query to match fewer than batch_size results in the RPC deadline, and thus return without batch_size results.
If the RPC successfully returns and has zero results, more should be false. If the RPC can't find any results, but isn't done scanning, it may return an error.
If you really need batch_size results you should verify that you get have batch_size results by issuing your query multiple times, and by updating start_cursor on every call. You should also use a few indexes as possible for serving your query.
The full document at https://cloud.google.com/datastore/docs/concepts/optimize-indexes should be helpful for you.

Avoiding keyword stuffing in SOLR

I'm looking for a way to limit the effect (or eliminate it) of "keyword stuffing" in SOLR. (We're currently running a SOLR 6.2.0 server).
I've tried setting omitTermFreqAndPositions="true", but when I do that, some queries throw phrase query errors (specifically queries with search terms such as G1966B - likely due to word splitting and such). I could go down the road of disabling the word splitting and try to avoid the phrase query errors, but this is simply going to mess up more things than I'm trying to fix.
Does anyone have any suggestions on how to limit the affect of multiple keyword matches in a single field?
Example: If we have a description field with something like this:
BrandX 1200 Series G1924B LC/MSD SL XBC System.
This BrandX 1200 Series G1924B ( G 1924 B , G1924 B , G 1924B ) LC/MSD SL XBC >System is in excellent condition.
When someone does a search for "G1924B" I would like to avoid scoring this document higher just because it happens to have G1924B (or a variation of that) in there several times.
In theory someone could repeat the keyword many times in their description to try to trick the system into ranking their search results higher.
Any suggestions?
Thanks!

This happens to appear as a more frequent requirement than initially thought.
If you remove both term freq and positions, you lose phrase search capability.
I would recommend to write a custom similarity that ignores TF ( Term Frequency).
At the moment the default BM25 take TF in consideration.
You can just pick that class and adjust the similarity calculus to consider TF as a constant.
e.g.
org.apache.lucene.search.similarities.BM25Similarity.BM25DocScorer#score
[1] org.apache.lucene.search.similarities.BM25Similarity

SOLR - Limiting Search Results

Is there a way to restrict the number of search results returned from SOLR. I am working for a client who would like to restrict the search results to 100 (based on search score) . I can use rows but that would only restrict the results per page and not the total results. The problem with that is If the sort function of SOLR is used, it would sort all the results and the product which has 105th rank based on score might come on top because of the low price. I want the sort to happen only on the top 100 results. Is there a way to do that ?
Thanks for your help!
Supreet

You can use the Sort By Function.
You will have to query the normal way with rows=100 and also add the &sort=<query>.
I could not try it as I do not have a Solr instance right now. Please let me know if ti works or not.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!

I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.

To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

Solr Distance Filtering

I am trying to do distance range search using Solr.
I know its very easy to do a search for filtering within the 5km range
&q=*:*&fq={!geofilt pt=45.15,-93.85 sfield=store d=5}
What I am after is how to do the same thing if I am looking in a range of say 5 to 10 km ??
Thanks

Here are a couple ways to approach this. So clearly the query goes into a filter query ("fq" param) since the intention is not to modify the score. And lets assume the these parameters are set in the request URL (although they don't have to be placed there):
pt=45.15,-93.85&sfield=store
Here is one approach:
_query_:"{!geofilt d=10}" -_query_:"{!geofilt d=5}"
I used the _query_ Solr syntax hack to enter a sub-query which offers the opportunity to switch the query parser from the Lucene one to a geo one.
Here's another approach that is probably the fastest:
{!frange l=5 u=10}geodist()
This one is a function query returning the distance that is then limited to the desired range. It is probably faster since it will evaluate each document once each instead of twice like the previous will.
You may want to mark this as not cacheable and add a bbox filter so that far fewer then every document is examined. Here is the final result (not url escaped):
pt=45.15,-93.85&sfield=store&fq={!frange l=5 u=10 cache=false cost=100}geodist()&fq={!bbox d=10}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

GAE Search API. Obtain total amount of matching documents - google-app-engine

Related

Does `fetch_page()` just not guarantee a minimum amount of results when used with `filter()`?

Avoiding keyword stuffing in SOLR

SOLR - Limiting Search Results

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

Solr Distance Filtering

Categories

Resources