I have a Chemical search application where we will execute a Molecular search using a standard molecule matching engine and retrieve the IDs of the chemical structures and the hit's score or Similarity value from the engine.
My application will then invoke a SOLR with the list of IDs retrieved from the engine. I want to add the hit's score to the results.
1. Can I simply add this calculated field to SOLR's results? How?
2. Could I implement a SIMILARITY function to supply it as the score instead of the score created by Lucene?
3. I want to order the results by the score. The molecule search can drive this can I tell SOLR to retain the order of the ids passed as the search query?
We are using SOLR 3.5. It is part of a stack provided by our vendor and cannot just upgrade it.
I'm thinking implementing a custom Search handler to do molecule pre-search and then search solr with the output.
I am very new to SOLR and any help would be appreciated.
If you send IDs into Solr and then sorting by those same IDs, what do you actually need a Solr for? Or are you sub-selecting from those IDs afterwards using Solr query?
In any case, if your implementation allows you to change solrconfig.xml, you should be able to sneak a custom Request Handler in, which should allow you to build your pre- and post- processing. Here is one somewhat relevant article.
Regarding custom similarity, I am not sure you mean what you think you mean (custom Request Handler is a higher level intercept). However, if you do mean it, Wiki discusses what is possible before and after Solr 4.
Related
I have managed to create a dataset using Apache Solr. I have also managed to make queries, such as in this example:
content:(test1 OR test2) OR title: test2
I would now like to search the dataset using an entire string, in similar fashion to searching on google. Is the correct way to approach this to keep using or tags on the title and content for each word within the query, or is there a better way to achieve this ? (I am not looking for exact matches, just the most relevant ones)
You can use dismax or edismax for your approach and can pass the phrases if you have with the boosting.
The DisMax query parser is designed to process simple phrases (without
complex syntax) entered by users and to search for individual terms
across several fields using different weighting (boosts) based on the
significance of each field. Additional options enable users to
influence the score based on rules specific to each use case
(independent of user input).
The detailed parameters are found on the solr page at Solr Dismax
I have a big list of related terms (not synonyms) that I would like my solr engine to take into account when searching. For example:
Database --> PostgreSQL, Oracle, Derby, MySQL, MSSQL, RabbitMQ, MongoDB
For this kind of list, I would like Solr to take into account that if a user is searching for "postgresql configuration" he might also bring results related to "RabbitMQ" or "Oracle", but not as absolute synonyms. Just to boost results that have these keywords/terms.
What is the best approach to implement such connection? Thanks!
You've already discovered that these are synonyms - and that you want to use that metainformation as a boost (which is a good idea).
The key is then to define a field that does what you want - in addition to your regular field. Most of these cases are implemented by having a second field that does the "less accurate" version of the field, and apply a lower boost to matches in that field compared to the accurate version.
You define both fields - one with synonyms (for example content_synonyms) and one without (content), and then add a copyField instruction from the content field (this means that Solr will take anything submitted to the content field and "copy" it as the source text for the content_synonyms field as well.
Using edismax you can then use qf to query both fields and give a higher weight to the exact content field: qf=content^10 content_synonyms will score hits in content 10x higher than hits in content_synonyms, in effect using the synonym field for boosting content.
The exact weights will have to be adjusted to fit your use case, document profile and query profile.
I'm trying to use the MoreLikeThis Solr's feature to find similar document based on some other document, but the I don't quite understand how some of this functionality works.
As it says here, the MoreLikeThis component works best, when the termVectors are stored. And here comes my confusion.
Is it enough that I enable the flag termVectors on a field (let's say the field contains a movie review text) in Solr's schema.xml file? Will it make Solr calculate the termVectors for a given field after inserting it, store it and then use the calculcated termVectors in subsequent calls to the MoreLikeThis handler?
Short answer is NO, you need to re-index after such a schema change.
Having the term vector enabled, will speed up the process of finding the interesting terms from the original input document ( if this document is in the index).
Second phase timing (when More Like This query happens), will remain the same.
For more information about how the MLT works [1] .
In general, when applying such changes to the schema, you need to re-index your documents to make Solr builds the related data structures(the term vector is a mini index per document, and requires specific files to be stored on disk[2]
N.B. this will increase your disk utilisation)
[1] https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene
[2] https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene50/Lucene50TermVectorsFormat.html
Not sure if it is a relevant query to post, but want to understand auto-suggestion is suitable option for location based search as I am looking for specific requirement. The requirement is, from a specified geo location, want to search for providers(be it doctor with specialty or hospitals) using auto suggestion.
As part of suggestion, I need to pass geo location with search key, the search key would be a doctor’s name or doctor’s specialty or hospital name or hospital address, the suggester would provide the results on the basis of geo distance in ascending order.
The weightage option would be calculated on the basis of distance by inverse value.
I posted earlier a query here (solr autosuggestion with tokenization), this post is relevant to my earlier query.
Regards
Venkata Madhu
If you want to add more logic to the suggestions that you're going to show is probably a good idea to use normal queries instead of the suggest component.
For instance take a look at this repo is a (bit outdated) example of using a normal solr core to store suggestions and do suggest-like queries. Meaning you can do partial match queries on that index and add the custom scoring logic that you want. Keep in mind that it doesn't need to be a separated core you could just copy data from the fields that you have in a separate field used only for generating the suggestions.
In this case, you'll only need to add/edit the score function used to add your own logic (geodist) or even do a hard sort on the distance.
I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.