How to extract BM25 score from Solr - solr

for an document auto-tagging system we would like to apply tags based on Solrs BM25 measure.
Our algorithm should perform like this:
indexed documents with applied tags already stored in Solr
new documents without tags are posted => apply tags based on the nearest neighbor of this document (afaik document with best BM25)
So my questions:
Is this feasible? Can I extract the BM25-score out of Solr? This could require first indexing a document get the nearest neighbor and his tags and then deleting the new doc and re-index with applied tags from nearest neighbor
Is this in general a good Idea to do so?

Related

How to disable page boosts when indexing?

Nutch by default enables the scoring-opic plugin. From my understanding, the scoring plugin is responsible for setting the score of each url in the crawldb. This score will be used in two ways:
During the generation of a new segment (fetch list) with -topN, the score determines which urls will be part of the fetch list (those urls with the highest scores will be part of the fetch list).
During indexing into Solr using the indexer-solr plugin, the score will be used to set the boost of the document indexed into Solr.
Please correct me if I am wrong about any of the above.
For my use case:
I want to disable boosts when indexing into Solr.
As I am crawling only a few URLs, and I do not want links from/to outside each individual URL to affect the score. For example, if there is a link from http://siteA.com to http://siteB.com, siteB's score should not be affected. Whereas if there is a link from http://siteA.com/first to http://siteA.com/second, I want the score for http://siteA.com/second to increase.
What setting can I tweak to accomplish these two goals?
Regarding your first question you could remove the boost field from the Solr Index Writer mapping (take a look at https://cwiki.apache.org/confluence/display/nutch/IndexWriters#Mapping_section). This should avoid sending the field to Solr.
Regarding the URL scoring for internal/external links, you could try changing the scoring config in the nutch-site.xml file. By default, both internal/external links are set to 1.

Solr- Find "Significant Terms" on Subset of Documents

I'm trying to get "significant terms" for a subset of documents in Solr. This may or may not be the best way, but I'm currently attempting to use Solr's TF-IDF functionality since we have the data stored in Solr and it's lightning fast. I want to restrict the "DF" count to a subset of my documents, through a search or a filter. I tried this, where I'm searching for "apple" in the name field:
http://localhost:8983/solr/techproducts/tvrh?q=name:apple&tv.tf=true&tv.df=true&tv.tf_idf=true&indent=on&wt=json&rows=1000
and that of course, only gives me documents that have "apple" in the name, but my document frequency gives the counts from the entire dataset, which doesn't seem like what I want. I would think Solr can do this, but maybe not. I'm open to suggestions.
Thanks,
Adrian
It is one the works I have in my backlog[1].
What you need is actually the document frequency in your foreground set ( your subset of docs) and the document frequency in your background set(your corpus).
Solr won't do that out of the box, but you can work on it.
Elastic Search has a module for that you can inspiration from[2]
[1] https://issues.apache.org/jira/browse/SOLR-9851
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

Apache Solr or Lucene proximity search on multiple fields

Is it possible in solr/lucene to search on different multivalued fields?
Imagine to have an XML fragment like this:
<normative>
<ref><aut>State</aut><num>70</num>><year>2007</year><article>13</article></ref>
<ref><aut>TreasuryMinistry</aut><num>350</num><year>2011</year><article>21</article></ref>
</normative>
Is it possible to retrieve documents containing for instance:
num:70 AND year:2007
inside the same ref ?
i.e. this document should not be found for a query like
num:70 AND year:2011.
I could create catenated fields like
<ref cat='state-0070-2007-0013'/>
<ref cat='TreasuryMinistry-0350-2011-0021'/>
but the user must be able to find by every combination of fields, i.e.
num and year,
year and article,
num and article,
aut and num and year,
on the same ref!
I am not experienced with solr/lucene, so I fear that a wild card search like
cat:'*-0070-2007-*'
could not be not performant over our normative document corpus.
Is there a way to make a search based on relative position?
Something like using copyField to a multivalue field with different positionincrementGaps?
Not directly answering your proximity question, but can you treat each as a document? If so, then a search like 'num:70 AND year:2007' should work fine, assuming you create the 'num' and 'year' fields.

How to sort by tag considering the tags weights related to every document?

I'm building up a Solr search engine to search on a 300k documents collection. Among the many indexed fields, an important one is tags.
My idea is to assign to every document a vector of tags, each one with a given weight (basically depending on the number of users who chose that tag for that document). For instance
Doc1 = {tag1:0.3, tag2:0.7, tag3:0.8, tag4:1}
Doc2 = {tag2:0.5, tag3:0.8, tag4:0.8, tag5=0.9}
Using this example, when someone ask for documents tagged with tag4, I would give back both the documents of course, but Doc1 with an highest score since it has tag4 weighted higher.
Ideally, the way to implement this on Solr, would be something like creating a multiValued field called "tags", and assign at indexing time a weight to each tag contained in such a field. So, first question:
Is it possible to assign a term frequency (as a tag weigth) manually at indexing time?
To what I found... seems not! Ok... a workaround is to copy for instance tag4 10 times on the tags field of Doc1 and just 8 on the tags field of Doc2. Of course has some drawbacks and limitations.
However here comes the bigger problem I cannot solve even with a workaround. I would like to define my own score. The one that fit better my specific case would be something like sort=tf(tags,tag4). In fact TF is in this case much more important than IDF! Unfortunately this feature (Relevance Functions) will be released just in Solr 4: http://wiki.apache.org/solr/FunctionQuery#tf
Have you got any idea about how to change the scoring function in Solr 3.5 giving more importance to TF and less to IDF?
Is there any hack to do it simply, or would you change the Lucene source code (if yes... what and where?), or would you use the Solr4 night build?
Thanks in advance for your advices!

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)
Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.
I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Resources