How to make another search call inside SOLR - solr

I would like to implement some kind of fallback querying mechanism inside SOLR. That is if a first search call doesn't generate enough results, I would like to make another call with different ranking and then combine the results and return it. I guess this can be done in the SOLR client side but I hope to do this inside the SOLR. By reading documentation, I guess I need to implement a search component and then add it next to "query" component? Any reference or experience in this regard would be highly appreciated.

SearchHandler calls all the registered search components in order you define, and there are several stages (prepare,process etc.).
You know the number of results only after the distributed processing phase (I suppose you work with distributed mode),so your custom search component should check the number of results in response object and run its own query if necessary.
Actually you may inherit (or wrap) a regular QueryComponent for that, augmenting its process/distributed process phases.

Related

solr integrate highlighting data within main search results

I am using highlighting feature in solr 8.3 with default method. With current behaviour, main search results and highlighted results are shown in different blocks. Is there a way we can implemented highlighting within the search main results, without having to return extra block for highlighting?
I believe that due to performance factor(like default limit values for hl.maxAnalyzedChars, hl.snippets, hl.fragsize) that highlight is returned as separate component. But, if someone has written custom component to integrate both, please share the steps. Also, please share the performance of it.
solr-results-with-highlights

Why is it not suggested to implement typeahead using Wildcard search?

Normally a majority of tutorials either suggest implementing autosuggest, either using Suggester component or primitive typehead techniques:
https://blog.griddynamics.com/implementing-autocomplete-with-solr/
However my question is why no one suggests using simple wildcard search for this like for giving name suggestions when user types mob:
q=name:(*mob*)
Is it feasible to use this approach for implementing autosuggest against other approaches?What will be the repercussions?
The strategy can work - for simple queries. The problem is that when you're querying with wildcards, the analysis chain is not invoked (a bit of a simplification - most filters are not invoked, only those that are MultiTermAware) - so as soon as you type a space, you're out of luck. You can work around this with the ComplexPhraseQuery, but that might not be what you're looking for (and can get expensive in regards to the number of terms quickly).
In your example with a leading wildcard, the query will also be very expensive - since it will require Lucene (Solr's underlying search library) to in effect look at each generated token and see if somewhere inside that token there's the text mob. And since you don't have any analysis taking place - if you'd have indexed men's (which would be processed to match just men as a single token in most cases), and searched for men's* - you wouldn't get a hit.
So it works - kind of - but it's not ideal. That's the reason why the suggester was implemented. The suggester component supports many different configuration options to get the behavior you want, as well as (for some backends) context filtering (which would be easier to implement with just a wildcard, since it'd be a regular fq). The suggester also supports weights - while wildcards wouldn't really do that in a proper way.

Any examples of using a Wandsearcher in vespa ? (After a weighted set query)

Currently i am using the REST interface to query vespa, which seems to work great but something tells me that i should be using searchers in the application to make the client(server side code) a bit lighter (bundle the jar file in the application package) to make it a bit smoother. I have managed to do some simple searcher/processor applications. But this is a bit overwhelming.
So are there any readily available examples ?
Basicially i want to:
Send to /search?query=someId
Do a ordinary search for the weighted set on this documentID (I guess this one can be handy: https://docs.vespa.ai/documentation/reference/inspecting-structured-data.html)
Take those items in the response and add it to a wand item(s) and query for a wand with wandsearcher on a given field. Similar to the yql:
"select * from sources * where wand(interest, some weightedsets));","ranking":"combined_score" and return the matches.
Just curious also, apart from the trouble of string building with the http request i am doing at the moment are there any performance gains of using a searcher or go the java route vs rest?
thanks for any insight or code help i can start with.
There is an example of using the WandItem (YQL wand)here https://docs.vespa.ai/documentation/advanced-ranking.html and see also https://docs.vespa.ai/documentation/using-wand-with-vespa.html as there are two wand implementations available in Vespa, it sounds from the description that the wand() is what you want to use for this use case. For the first call you probably want to have a dedicated document summary to reduce the amount of data fetched for your first query and also the option of serving it out of memory only (See https://docs.vespa.ai/documentation/document-summaries.html)
Also see https://docs.vespa.ai/documentation/searcher-development.html as a general resource on writing searchers.
For your use case it makes a lot of sense to write a searcher to perform these two queries as your second query depends on the first and you avoid the cost of rendering/http/yql parsing which might matter if your client is remote with high network latency.

Custom Searcher - Blending of hits from different sources

We have a need for "Blending of hits from different sources", as per your documentation it is recommended to write a custom-searcher in JAVA. Is there a demo of this written somewhere on Github ? I wouldn't even know where to start :( I understand I can create search "chains" , preferably Asynchronous, and then blend results in JAVA before returning them...but then how would I handle paginations, limits...etc ? This all seems very complicated, for someone who doesn't even know JAVA that much. So, I am hoping someone has already written a demo for this ? Please ? Anyone ?
Thank you so much
EDIT to make my quesion clearer:
We are writing a search engine that fetches data from various websites. Some websites have 10mil indexable items, other websites only 100,000. When we present the results to end user, we want to include results from all our sources ( when match applies ). Let's say 10 results from each of the websites we crawl, so that they all get equal amount of attention on page. If we don't do custom blending, what happens is that the largest website with most items wins all our traffic.
I understand that we can send 10 separate queries to VESPA, and blend the results in our front end, but that seems very inefficient. Thus, the quesion of "Custome Searcher". Thank you so much !
That documentation covers some very advanced use cases which you do not have. Are your sources different Vespa schemas or content clusters? If so Vespa will by default blend the hits returned from each according to their relevance scores so there's nothing you need to do.
The two other most common use-cases are:
Some (or all) the data sources are external, so you need to write a Searcher component to fetch the external data and turn it into a Result.
You want the data to be blended in some custom way (rather than by relevance score). If so you need to exclude the default blending Searcher (com.yahoo.prelude.searcher.BlendingSearcher) and write your own.
If you provide some more information about your use cases I can give you some code examples.
EDIT: Use grouping to solve the need explained under "EDIT" in the question:
Create a "siteid" field when feeding (e.g in document processing).
Use the grouping expression all(group(siteid) each(max(10) output(summary())))
See http://docs.vespa.ai/documentation/grouping.html

Camel condition on aggregate of messages

I'm looking for a way to conditionally handle messages based on the aggregation of messages. I've looked into a lot of ways to do this, but it seems that Apache Camel doesn't support it. I'll explain the scenario and then the solutions I tried.
Scenario:
I'm trying to conditionally clean a directory. I poll from the directory every x days and fetch all the files (file://...). I route this into an aggregation, that aggregates the files into a single size (directorySize). I then check if this size passes a certain threshold.
Here is where the problem lies. I now want to remove certain files if this condition passes, but I don't have access to the original messages anymore because they were aggregated in a new exchange.
Solutions:
I tried to fetch the files again to process them. Problem is that you can't make a consumer fetch on demand as far as I know. I tried using pollEnrich, but that will only fetch a single file and not all files in the directory.
I tried to filter/stop the parent route. The problem here is that filter()/choice...stop()/end() will only stop the aggregated route with the directory size and not the parent route with the file messages. I can't conditionally process these.
I tried to move the aggregated condition to another route that I would call, but this causes the same problem as the first solution.
Things I consider doing:
Rewrite the aggregation strategy to not only aggregate the size, but also the files itself into a groupedExchange. This way I can split the aggregation again after the check. I don't really like this solution because it causes a lot boilerplate, both in code as during runtime.
Move the file size calculator to a processor instead of the aggregator. This would defeat the purpose of using camel in the first place.. I would manually be fetching the files and adding the sizes.. And that for every single file..
Use a ControlBus to dynamically start the delete route on that directory. Once again a lot of workaround to achieve something that I feel should be able to be done in a simple route.
I would like to set the calculated size on every parent message, but I have no clue how this could be achieved?
Another way to stop the parent route that I haven't thought of?
I'm a bit stunned that you can't elegantly filter messages based on the aggregation of these messages. Is there something that I missed in Camel that would provide an elegant solution? Or is this a case of the least bad solution?
Simple Schema
Message(File)
Message(File) --> AggregatedMessage(directorySize) --> delete certain Files?
Message(File)
Camel is really awesome, but sometimes it's sure difficult to see exactly which design pattern to use ;)
Firstly, you need to keep a copy of the file objects, because you don't know whether to delete them or not until you reach your threshold - there are basically (at least) two ways to do this.
Alternative 1
The first way is to use a List in an exchange property. This property will hang around no matter what you do with the exchange body. If you have a look at the source code for GroupedExchangeAggregationStrategy, it does precisely this:
list = new ArrayList<Exchange>();
answer.setProperty(Exchange.GROUPED_EXCHANGE, list);
// ...
list.add(newExchange);
Or you could do the same thing manually on your own exchange property. In any case, it's completely fine to use the Grouped aggregation strategy as you have done.
Alternative 2
The second way to "keep" old messages is to send a copy to a stopped SEDA queue. So you would do to("seda:xyz"). You define this queue as .noAutoStartup(). Then you can send messages to it and they will queue up on an internal queue, managed by camel. When you want to process the messages, you simply start it up via controlbus and stop it again afterwards.
Generally, messing around with starting and stopping queues should be avoided unless absolutely necessary, but that's certainly another way to do it
Suggested solution
I suggest you do as you have done (i.e. alternative 1):
aggregate via GroupedExchangeAggregationStrategy to keep the individual files in a list
Compute the total file size (use a processor, or do it along the way with a custom aggregation strategy)
Use a filter(simple("${body} < 123"))
"Unwind" your aggregation via a splitter(simple("${property.CamelGroupedExchange}"))
Delete your files one by one
Please let me know if this doesn'y makes sense, or if I have misunderstood your problem in any way.

Resources