How to get full documents via MoreLikeThis search in solr? - solr

I´m quite new to the MoreLikeThis search in solr but i find one option is missing.
The wiki pages and google (and stack overflow) search says nothing about the document format of the returned value of a MLT-Search.
My aim is to get either all or at least a specified field-set in the returned documents, but it seams that one have no influence which fields are included in the similar documents.
Of course one can do a query for each of the documents from the moreLikeThis result to get those field but i don´t like the idea to do multiple queries where just one could really be sufficient.
I would really appreciate if anybody does knows a way to influence the result format of the documents.
Thanks.

Related

Solr MoreLikeThis find documents that are near identical

I have index with documents that are basically scraped website content. I need to be able to serve documents that are nearly identical. This requirement arises when one website copies content from another website. They do change some words, but mostly the text is 80% - 90% the same and I need to group such content, basically find its near duplicates. So the requirement is to find and group documents that are more than 75% similar to one another.
I was experimenting with Solr MLT, and I'm pleased with overall results, but I can't find a nice and efficient way to get normalized results.
The closest I got to a result that I need is to send the document content via stream.body (that document is already in the index) to MLT \mlt request handler and then see what score is returned for the same document that is already indexed. With that I can calculate how similar are other documents.
But this seems to be very wasteful of resources and I feel that there has to be a better way to achieve this task.
So my question is: can MLT produce such results, or am I stretching what MLT can achieve?

Solr queries stored within Solr field

I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.

Similarity/approximate queries in Solr

What is the simplest way to query Solr for the documents that contain text similiar to a (longish) passage. This is similar to what ElasticSearch match queries do or what probabilistic search engines like Indri do by default. This is something between an and and an or query. None of the terms is required, but you get documents that contain many of the terms. You can also just pass a passage of raw text to the engine and it returns documents with high term overlap with the passage without having to try to parse or tokenize the text in the client. The best I option can see in the Solr query reference is to tokenize the query text myself and then insert an OR between each pair of terms and return the top N results. Is there more concise way of doing it with Solr?
The answer above is correct. You can choose to find documents similar to another document in the index, similar to a given external URL or similar to some given text. You can choose what field(s) to target and various other parameters. Here's the official Solr Reference Guide documentation page for MLT: https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Apache Solr Date Searching

I am working on two different searching tools: DtSearch and Solr. I do a FULL_TEXT search on one indexed search term ("2008/12/02") and unfortunately both give different hits though the data are the same. Another strange thing I notice is that Solr gives three DOC_ID as hits and DtSearch gives me five for the same search terms.
I am confused about date searching now. How can it be possible though the data are the same?
Do I need to apply some extra settings in config files? Is there any way I get consistent output?
Thank you,

How do I retrieve all applicable facet fields for a Solr search

I'm trying to use Solr for faceted-seaarch on a website.
When a user fires off a search query, I query Solr and retrieve the search results which can then be displayed.
My question is - how do I find out which facet fields and terms are applicable to the search results?
To be clear - different categories of products have different facet fields and I want to find a way to bring back the most relevant facet fields for the search results that have been returned. I don't want to have to specify the fields - I'd like Solr to identify the relevant ones for me.
Thanks in advance!
I would recommend looking over all of the Simple Facet Parameters on the Solr Wiki, especially the examples at the bottom as they will show you all of the possible ways that you can configure the faceting results for your queries.
If I am understanding your question correctly... by default faceting will only bring back facets/counts based on the documents in the result set. However to make those more relevant to the search, you should set the facet.mincount to something other than the default value of 0. eg. &facet.mincount=1. But, again please refer to the documentation on how this works and can be applied to your scenario.
Im having the same problem.
What I eventually did was to query Solr for the top 50 hits for a given query and then collect the names of the properties set on those products. I then do another query with the facet fields set to the product properties I found first time around.

Resources