I'm trying to use the MoreLikeThis Solr's feature to find similar document based on some other document, but the I don't quite understand how some of this functionality works.
As it says here, the MoreLikeThis component works best, when the termVectors are stored. And here comes my confusion.
Is it enough that I enable the flag termVectors on a field (let's say the field contains a movie review text) in Solr's schema.xml file? Will it make Solr calculate the termVectors for a given field after inserting it, store it and then use the calculcated termVectors in subsequent calls to the MoreLikeThis handler?
Short answer is NO, you need to re-index after such a schema change.
Having the term vector enabled, will speed up the process of finding the interesting terms from the original input document ( if this document is in the index).
Second phase timing (when More Like This query happens), will remain the same.
For more information about how the MLT works [1] .
In general, when applying such changes to the schema, you need to re-index your documents to make Solr builds the related data structures(the term vector is a mini index per document, and requires specific files to be stored on disk[2]
N.B. this will increase your disk utilisation)
[1] https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene
[2] https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene50/Lucene50TermVectorsFormat.html
Related
I have a big list of related terms (not synonyms) that I would like my solr engine to take into account when searching. For example:
Database --> PostgreSQL, Oracle, Derby, MySQL, MSSQL, RabbitMQ, MongoDB
For this kind of list, I would like Solr to take into account that if a user is searching for "postgresql configuration" he might also bring results related to "RabbitMQ" or "Oracle", but not as absolute synonyms. Just to boost results that have these keywords/terms.
What is the best approach to implement such connection? Thanks!
You've already discovered that these are synonyms - and that you want to use that metainformation as a boost (which is a good idea).
The key is then to define a field that does what you want - in addition to your regular field. Most of these cases are implemented by having a second field that does the "less accurate" version of the field, and apply a lower boost to matches in that field compared to the accurate version.
You define both fields - one with synonyms (for example content_synonyms) and one without (content), and then add a copyField instruction from the content field (this means that Solr will take anything submitted to the content field and "copy" it as the source text for the content_synonyms field as well.
Using edismax you can then use qf to query both fields and give a higher weight to the exact content field: qf=content^10 content_synonyms will score hits in content 10x higher than hits in content_synonyms, in effect using the synonym field for boosting content.
The exact weights will have to be adjusted to fit your use case, document profile and query profile.
I run a query against a SOLR core and restrict the result using a filter
like fq: {!frange l=0.7 }query($q). I'm aware that SOLR scores do not
have an absolute meaning, but the 0.7 (just an example) is calculated
based on user input and some heuristics, which works quite well.
The problem is the following: I update quite a few documents in my core.
The updated fields are only meta data fields, which are unrelated to the
above search. But because an update is internally a delete + insert, IDF
and doc counts change. And so do the calculated scores. Suddenly my
query returns different results.
As Yonik explained to me here, this behaviour is by design. So my question is: What is the most simple
and minimal way to keep the scores and the output of my query stable?
Running optimize after each commit should solve the problem, but I
wonder if there is something simpler and less expensive.
You really need to run optimize. When you optimize the index solr clean all documents not pointed yet and make the query stable. This occurs because build this meta data information is expensive to be done all the time a document is updated. Because of this solr just do that on optimize. There is a good way to see if your index is more or less stable... When you access Solr API you could see Num Docs and Max Doc information. If Max Doc is greater than Num Docs it seams that you have some old products affecting your relevancy calculation. Optimizing the index these two numbers is made equal again. If these numbers are equal you can trust IDF is been calculated correctly.
I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.
I use Lucene 4.4 to store users' reading profiles which are represented by word vectors and are stored in a single document field. These profiles are frequently modified: some terms counts need to be incremented or decremented. Is there a better way to update term frequencies than loading the whole document term vector, modifying it and then indexing again?
No, to update a document in Lucene, you must reindex the document. The process can be simplified using a call to updateDocument, but this doesn't simplify the operation on the backend. It still must delete the old document, and index a new one.
This is a bit of a weird one and I'm not sure solr can do it. I have a collection of document from differing sources some are time sensitive and some are evergreen. I'd lie to be able to give the user results that contain both. Right now I'm bosting the score of newer documents as describe here but that means the evergreen docs don't show up as much as I'd like.
I'd like to be able to include a factor in the boost that modifies it according to the class of document. In other words time sensitive docs would get one boost value based on age and evergreen ones would get a different boost or none at all.
Is there any way to tell solr not to apply the time boost to some docs?
Why don't you index evergreen document with index-time document-level boost? Then, if they match at all, they will have that boost combined with query time boost.
You can apply that boost in the update format (XML or JSON), DataImportHandler with $docBoost or in a RequestUpdateProcessor if there is a specific type to check that it is evergreen.