Why is it not suggested to implement typeahead using Wildcard search? - solr

Normally a majority of tutorials either suggest implementing autosuggest, either using Suggester component or primitive typehead techniques:
https://blog.griddynamics.com/implementing-autocomplete-with-solr/
However my question is why no one suggests using simple wildcard search for this like for giving name suggestions when user types mob:
q=name:(*mob*)
Is it feasible to use this approach for implementing autosuggest against other approaches?What will be the repercussions?

The strategy can work - for simple queries. The problem is that when you're querying with wildcards, the analysis chain is not invoked (a bit of a simplification - most filters are not invoked, only those that are MultiTermAware) - so as soon as you type a space, you're out of luck. You can work around this with the ComplexPhraseQuery, but that might not be what you're looking for (and can get expensive in regards to the number of terms quickly).
In your example with a leading wildcard, the query will also be very expensive - since it will require Lucene (Solr's underlying search library) to in effect look at each generated token and see if somewhere inside that token there's the text mob. And since you don't have any analysis taking place - if you'd have indexed men's (which would be processed to match just men as a single token in most cases), and searched for men's* - you wouldn't get a hit.
So it works - kind of - but it's not ideal. That's the reason why the suggester was implemented. The suggester component supports many different configuration options to get the behavior you want, as well as (for some backends) context filtering (which would be easier to implement with just a wildcard, since it'd be a regular fq). The suggester also supports weights - while wildcards wouldn't really do that in a proper way.

Related

Any examples of using a Wandsearcher in vespa ? (After a weighted set query)

Currently i am using the REST interface to query vespa, which seems to work great but something tells me that i should be using searchers in the application to make the client(server side code) a bit lighter (bundle the jar file in the application package) to make it a bit smoother. I have managed to do some simple searcher/processor applications. But this is a bit overwhelming.
So are there any readily available examples ?
Basicially i want to:
Send to /search?query=someId
Do a ordinary search for the weighted set on this documentID (I guess this one can be handy: https://docs.vespa.ai/documentation/reference/inspecting-structured-data.html)
Take those items in the response and add it to a wand item(s) and query for a wand with wandsearcher on a given field. Similar to the yql:
"select * from sources * where wand(interest, some weightedsets));","ranking":"combined_score" and return the matches.
Just curious also, apart from the trouble of string building with the http request i am doing at the moment are there any performance gains of using a searcher or go the java route vs rest?
thanks for any insight or code help i can start with.
There is an example of using the WandItem (YQL wand)here https://docs.vespa.ai/documentation/advanced-ranking.html and see also https://docs.vespa.ai/documentation/using-wand-with-vespa.html as there are two wand implementations available in Vespa, it sounds from the description that the wand() is what you want to use for this use case. For the first call you probably want to have a dedicated document summary to reduce the amount of data fetched for your first query and also the option of serving it out of memory only (See https://docs.vespa.ai/documentation/document-summaries.html)
Also see https://docs.vespa.ai/documentation/searcher-development.html as a general resource on writing searchers.
For your use case it makes a lot of sense to write a searcher to perform these two queries as your second query depends on the first and you avoid the cost of rendering/http/yql parsing which might matter if your client is remote with high network latency.

Custom Searcher - Blending of hits from different sources

We have a need for "Blending of hits from different sources", as per your documentation it is recommended to write a custom-searcher in JAVA. Is there a demo of this written somewhere on Github ? I wouldn't even know where to start :( I understand I can create search "chains" , preferably Asynchronous, and then blend results in JAVA before returning them...but then how would I handle paginations, limits...etc ? This all seems very complicated, for someone who doesn't even know JAVA that much. So, I am hoping someone has already written a demo for this ? Please ? Anyone ?
Thank you so much
EDIT to make my quesion clearer:
We are writing a search engine that fetches data from various websites. Some websites have 10mil indexable items, other websites only 100,000. When we present the results to end user, we want to include results from all our sources ( when match applies ). Let's say 10 results from each of the websites we crawl, so that they all get equal amount of attention on page. If we don't do custom blending, what happens is that the largest website with most items wins all our traffic.
I understand that we can send 10 separate queries to VESPA, and blend the results in our front end, but that seems very inefficient. Thus, the quesion of "Custome Searcher". Thank you so much !
That documentation covers some very advanced use cases which you do not have. Are your sources different Vespa schemas or content clusters? If so Vespa will by default blend the hits returned from each according to their relevance scores so there's nothing you need to do.
The two other most common use-cases are:
Some (or all) the data sources are external, so you need to write a Searcher component to fetch the external data and turn it into a Result.
You want the data to be blended in some custom way (rather than by relevance score). If so you need to exclude the default blending Searcher (com.yahoo.prelude.searcher.BlendingSearcher) and write your own.
If you provide some more information about your use cases I can give you some code examples.
EDIT: Use grouping to solve the need explained under "EDIT" in the question:
Create a "siteid" field when feeding (e.g in document processing).
Use the grouping expression all(group(siteid) each(max(10) output(summary())))
See http://docs.vespa.ai/documentation/grouping.html

preventing certain docs from being indexed in clucene

I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.
Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.
Cheers!
You can't really access that type of data in a Document
What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.
This introduces some more work, but the best way to achieve that IMO.

How to make another search call inside SOLR

I would like to implement some kind of fallback querying mechanism inside SOLR. That is if a first search call doesn't generate enough results, I would like to make another call with different ranking and then combine the results and return it. I guess this can be done in the SOLR client side but I hope to do this inside the SOLR. By reading documentation, I guess I need to implement a search component and then add it next to "query" component? Any reference or experience in this regard would be highly appreciated.
SearchHandler calls all the registered search components in order you define, and there are several stages (prepare,process etc.).
You know the number of results only after the distributed processing phase (I suppose you work with distributed mode),so your custom search component should check the number of results in response object and run its own query if necessary.
Actually you may inherit (or wrap) a regular QueryComponent for that, augmenting its process/distributed process phases.

How to perform Geo Spatial search with django-haystack + solr

I'm currently using django haystack with xapian. I couldn't find any documentation on how to perform geospatial queries on xapian. But there seems to be some momentum on Solr. So i'm currently experimenting with that.
I couldn't get spatialSolr to work properly on local, but for now working with spatial-solr-light, which seems to work fine. It accepts queries like
http://127.0.0.1:8080/solr/select/?q=blahblah&spatial={!radius=1.0%20sort=true}lat:10.0,lng:-10.0
Can anyony point me to a patch for haystack that allows me to pass custom queries like that. I could use raw_search(), but i can't chain the resuts. In any case i would like to find a cleaner way to do something like
sqs.spatial(....)
There are some patches from other people mentioned on the google group(links below), but most of them are unreachable.
References:
https://github.com/fizx/solr-spatial-light
http://groups.google.com/group/django-haystack/browse_thread/thread/d0e23d45c0baa300/2298b6cf43389e18?lnk=gst&q=Spatial#2298b6cf43389e18
http://groups.google.com/group/django-haystack/browse_thread/thread/f88d625679941d77/420892adac151a64
http://groups.google.com/group/django-haystack/browse_thread/thread/e3a70112ce944b00/33bd673fbaaed0a7?lnk=gst&q=jteam#33bd673fbaaed0a7
If you're not tied to Xapian, look at Django, Sphinx and search by distance. I had a similar problem when I ran across this question and this seems to solve it. Thanks to django-sphinx, it's about as easy to set up as Haystack. Sphinx also seems to offer more flexibility.
Here's a fork of django haystack that adds in support for :
https://github.com/sidmitra/django-haystack-spatialsolrplugin
And corresponding notes are here:
https://github.com/sidmitra/django-haystack-spatialsolrplugin/wiki/_pages
Sidmitra, I made port of your solution using haystack 1.2.X and solr 3.4. With some limitations to be frank - no support for schema generation at the moment, only LatLong geo type supported, sorting by distance is not perfect (but works)
https://github.com/frutik/django-haystack/tree/1.2.X
I agree with https://github.com/sidmitra/django-haystack-spatialsolrplugin .
It seems to be out-of-date, but I could beat it into shape with some work. Issues I had:
Hard to find the java SSP and when I found it it was the wrong version. http://www.dutchworks.nl/en/home/download.html was the link that worked for me.
The classpaths in the example xml files I found on the net were all wrong; I had to remove .solrext. from all of them.
The plugin was very picky about which directory it lived in; it couldn't talk to anything else until it was happily in solr/lib
solr_backend.py required the following patch (around line 505):
if self.spatial_query:
final_query = '{{!spatial circles={lat},{long},{radius} }}{0}'.format(final_query,**self.spatial_query)
I had further issues with making the solrconfig.xml so that GeoDistanceComponent never loaded before the query had a valid rsp.
In other words, you can certainly make it work, but you have to be able to deal with a number of error messages in both python and java before you get there.

Resources