Lucene comparing document contents

Lucene comparing document contents - solr

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.
Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?

I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);
Or if searching for a document already in the index:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);
Solr also includes a MoreLikeThis handler.

Related

Getiing a certain number of docs from Solr

I need to get only n first documents sorted by prevId field from Solr (and not getting all the docs but cut to rows value) It seems to have poor performance and moreover it returns me the wrong value of found docs.Is where any way to do it from SOLR gui
or raw request?

numFound is the total number of documents that matches your query in the index (which in this case is all the documents in the index), it's not the number of documents returned.
You can enable docValues on your field if sorting is slow for that field - but caching usually helps a lot when doing multiple sorts (as long as your index hasn't been modified in between). That being said, your query took 285ms on the Solr side, so maybe the slowness you're experiencing comes from somewhere else than Solr?
Different output formats (&wt=json etc.) might also be more efficient for deserializing in your language of choice (.. and for display in your browser, which does a lot of syntax highlighting for XML).

Can changes in synonyms.txt file take effect without reindex?

We are using Sunspot-solr 4.0 when I update synonyms file it does not change anything in search. Do I really need to re-index after making changes in synonyms.txt or there is any other trick to update synonyms file that I am missing?

That depends on when you're expanding the synonyms. If you're expanding at query time, the updates will be visible without any reindexing, but if you're expanding at index time (which is the recommended way), you'll have to reindex to get the new synonyms included in the index.
The reasoning behind recommending expansion at index time compared to query time is described in the old wiki:
This is because there are two potential issues that can arrise at query time:
The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
There's an really detailed explanation of what's actually happening behind the scenes available in Better synonym handling in Solr.
As long as you're aware of these issues and the trade-off, doing query time synonyms could work fine - but you'll have to test it against your queries and what you expect the results to be - and be aware of the pitfalls.

Relevance feedback in Apache Solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?

Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

How can I limit my Solr search to an arbitrary set of 100,000 documents?

I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)

Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.

You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.