Google App Engine - keyword search + ordering on other properties - google-app-engine

Say I have an entity that looks a bit like this:
class MyEntity(db.Model):
keywords = db.StringListProperty()
sortProp = db.FloatProperty()
I have a filter that does a keyword search by doing this:
query = MyEntity.all()\
.filter('keywords >=', unicode(kWord))\
.filter('keywords <', unicode(kWord) + u"\ufffd")\
.order('keywords')
Which works great. The issue I'm running into is that if I try to put an order on that using 'sortProp':
.order('sortProp')
ordering has no effect. I realize why - the documentation specifically says this is not possible and that sort order is ignored when using equality filters with a multi-valued property (from the Google docs):
One important caveat is queries with both an equality filter and a
sort order on a multi-valued property. In those queries, the sort
order is disregarded. For single-valued properties, this is a simple
optimization. Every result would have the same value for the property,
so the results do not need to be sorted further. However, multi-valued
properties may have additional values. Since the sort order is
disregarded, the query results may be returned in a different order
than if the sort order were applied. (Restoring the dropped sort order
would be expensive and require extra indices, and this use case is
rare, so the query planner leaves it off.)
My question is: does anyone know of a good workaround for this? Is there a better way to do a keyword search that circumvents this limitation? I'd really like to combine using keywords with ordering for other properties. The only solution I can think of is sorting the list after the query, but if I do that I lose the ability to offset into the query and I may not even get the results with the highest sort order if the data set is large.
Thanks for your tips!

Workaround 1:
Apply stemming algorithms for keywords then you won't need to do a comparison look up.
Workaround 2:
Store all unique keywords in separate entity group ("table"). From this group find keywords which match your criteria. Then do query with keywords IN [kw1, kw2, ...]. Make sure that the number of matching keywords is not too big, for example you can select only first 10.
Workaround 3:
Reorder list of items on application side
Workaround 4:
Use IndexTank for full-text search, or apply for "Trusted Tester Program" as mentioned by #proppy.

Instead of doing prefix matches, properly tokenize, stem and normalize your strings, and do equality comparisons on them.

Related

Getting same record on multiple pages, when implemented pagination in vespa

I am getting same record on different pages when implementing pagination using group by.
I am using the query mentioned below:
http://<hostname>:<port>/search/?yql=select * from sources document_name where sddocname contains 'document_name' | all(group(key) max(2) each(each(output(summary()))));
Are you looking at the grouping results or the normal hits structure? Please note that the grouping expression will not in any way affect the normal hits returned.
You will probably want to add LIMIT 0 / hits=0 and only look at the results from the grouping expression.
You also need a (stable) ordering of the hits for pagination by continuations to work well. This is usually the case as in most use cases there will be a ranking expression in place.
The default ordering in grouping expressions is by rank - in grouping expression syntax this would be order(max(relevance())).
The query above only limits on document type. All documents of that document type will match this query equally well. I tested this using the "album-recommendation-selfhosted" sample app, and relevance was 0 for all documents. When the relevance is the same for all documents, the order will essentially be random. The same thing may occur when doing e.g. order(-count()) if count() is the same for several groups.
I was able to achieve the expected results by adding and using a ranking profile using the random.match rank feature: https://docs.vespa.ai/documentation/reference/rank-features.html#random
I believe this should ensure a stable ordering of hits, although this may still produce different results if the query is dispatched to different (groups of) content hosts. If you need a stable global ordering, consider storing a random float/double to each document to rank/order by - this can also be used as a "tie breaker" to help ensure a stable order from ranking expressions.

Best way to query documents with a list of tags

I have an index which has a field that is a string collection, which contains a list of tags.
Does anyone know the most efficient way to query the index, with a list of tags to match against the tags string collection?
This is a very inefficient example of what I am trying to do:
/indexes/instruments/docs?api-version=2014-07-31-Preview&$top=10&$skip=0&$count=true&search=*&$filter=universes/any(t: t eq 'U') or universes/any(t: t eq 'B') or universes/any(t: t eq 'E')
In this example the tags field is "universes". The problem is that I need to filter on as many as 30 tags, so this query seems terrible!
This is the right way to express this query. It does look long syntactically but it should run fine from the efficiency perspective. What will dominate response time is not so much the number of terms here (at least in the order of magnitude you mentioned) but how big the matching set is.
10/16/2017 update: note that Azure Search now has a new filter function, search.in(), that provides a more compact representation and faster execution for queries like this. More details and API version requirements here: https://learn.microsoft.com/en-us/rest/api/searchservice/odata-expression-syntax-for-azure-search
Because you want to implement an explicit filter your query is indeed probably the best you can do. This style of filter will handle the tags being in any order which makes it better than other solutions that include 'fixing' the index by injecting fields that are a concatenated result of multiple tags.
For 30+ tags you might also get good results using a tag 'boost' scoring profile, then you can pass in the tags that you want as a parameter, give the results an unreasonably high boost. If then need to specifically filter by tags then in this scenario you would have to filter on the client to remove results that did not receive a score from your boost profile
http://azure.microsoft.com/blog/2015/02/05/personalizing-search-results-announcing-tag-boosting-in-azure-search/

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

solr sort,i want Specify a particular document at the first

solr sort,i want Specify a particular document at the first
for example:
Results :5,2,3,1
I want 2 at the first ,Other sorted in accordance with the rules
2,1,3,5
how to do this ?
I know of two ways you can try to tackle this using Solr.
The first is to use the QueryElevationComponent. This lets you define the top results at index time. As suggested in the documentation, this is good for placing sponsored results or popular documents at the top of the search results. The potential downside is that you have to be able to identify those documents at index time and not at query time.
The other approach is to boost the desired documents at query time using the bq parameter. To boost document 435, you would do something like this:
...&bq=id:435^10
Unfortunately, neither of these approaches give you absolute control over the order of the results.
The solution provided by Riking would certainly do the job if you don't mind processing the results after performing the search. Another approach you could consider is to add a field to your Solr schema that defines a display order or priority. You can then sort on that field to get the desired sort order.
If you are using Solr 3.1 or later, you can sort by a function query. The map function is useful for this.
sort=map(field_name,5,5,0) asc
In the above, field_name is the name of the field you want to sort by, 5 is the value you want to push to the front and 0 must be replaced with some number that you know is less than all other numbers.
Call the builtin sort() function, then shift the desired element to the front.
Pseudocode, in case you do not have a builtin method to shift it to the front:
tmp = desired;
int dIndex = array.indexOf(desired);
for(i=dIndex-1; i >= 0; i--)
{
array[i+1] = array[i]
}
In case you use standart query (not dismax) add "OR id:2^1000" to you query. Like this:
q=(text:lalala AND author:Bob) OR id:2^1000
that will place document with ID=2 at the top of results.

Order solr documents with same score by date added descending

I want to have search results from SOLR ordered like this:
All the documents that have the same score will be ordered descending by date added.
So when I query solr I will have n documents. In this results set there will be groups of documents with the same score. I want each of this group of documents to be ordered descending by date added.
I discovered I can accomplish this using function queries, more exactly using rord function http://wiki.apache.org/solr/FunctionQuery#rord, but as it is stated in the documentation
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use
since they must use a FieldCache entry at the top level reader, while
sorting and function queries now use entries at the segment level.
Hence sorting or using a different function query, in addition to
ord()/rord() will double memory use.
it will cause excess memory use.
What other options do I have ?
I was thinking to use recip(ms(NOW,startTime),1,1,0). Is this the best approach ?
Is there any negative performance impact if I use recip and ms ?
You can use multiple SORT conditions:
Multiple sort orderings can be separated by a comma, ie: sort=+[,+]...
http://wiki.apache.org/solr/CommonQueryParameters
So, in your case would be:
sort=score DESC, date_added DESC
Since your questions says:
All the documents that have the same score will be ordered descending
by date added.
the other answer you got is perfect.
Anyway, I'd suggest you to make sure that you really want to sort by date only for document with the same score. In my experience this has always been wrong. In fact, the solr score is not absolute but just relative to other documents, and each document is different.
Therefore I wouldn't sort by score and then something else, because it's hard to predict when you'll have the same score for different documents.
I would personally sort only on score and use a function to boost recent documents. You can find a good example on the solr wiki, the function used there is recip(ms(NOW,date_field),3.16e-11,1,1).
If you're worried for performance you can try index time boosting, which should be faster than query time boosting. Have a look here.

Resources