Searching wiki URLs using Solr - solr

I am trying to index and search a wiki on our intranet using Solr. I have it more-or-less working using edismax but I'm having trouble getting main topic pages to show up first in the search results. For example, suppose I have some URLs in the database:
http://whizbang.com/wiki/Foo/Bar
http://whizbang.com/wiki/Foo/Bar/One
http://whizbang.com/wiki/Foo/Bar/Two
http://whizbang.com/wiki/Foo/Bar/Two/Two_point_one
I would like to be able to search for "foo bar" and have the first link returned as the top result because it is the main page for that particular topic in the wiki. I've tried boosting the title and URL field in the search but the fieldNorm value for the document keeps affecting the scores such that sub-pages score higher. In one particular case, the main topic page shows up on the 2nd results page.
Is there a way to make the First URL score significantly higher than the sub categories so that it shows up in the top-5 search results?

One possible approach to try:
Create a copyField with your url
Extract path only (so, no host, no wiki)
Split on / and maybe space
Lowercase
Boost on phrase or bigram or something similar.
If you have a lot of levels, maybe you want a multivalued field, with different depth (starting from the end) getting separate entries. That way a perfect match will get better value. Here, you should start experimenting with your real searches.

Related

How to disable page boosts when indexing?

Nutch by default enables the scoring-opic plugin. From my understanding, the scoring plugin is responsible for setting the score of each url in the crawldb. This score will be used in two ways:
During the generation of a new segment (fetch list) with -topN, the score determines which urls will be part of the fetch list (those urls with the highest scores will be part of the fetch list).
During indexing into Solr using the indexer-solr plugin, the score will be used to set the boost of the document indexed into Solr.
Please correct me if I am wrong about any of the above.
For my use case:
I want to disable boosts when indexing into Solr.
As I am crawling only a few URLs, and I do not want links from/to outside each individual URL to affect the score. For example, if there is a link from http://siteA.com to http://siteB.com, siteB's score should not be affected. Whereas if there is a link from http://siteA.com/first to http://siteA.com/second, I want the score for http://siteA.com/second to increase.
What setting can I tweak to accomplish these two goals?
Regarding your first question you could remove the boost field from the Solr Index Writer mapping (take a look at https://cwiki.apache.org/confluence/display/nutch/IndexWriters#Mapping_section). This should avoid sending the field to Solr.
Regarding the URL scoring for internal/external links, you could try changing the scoring config in the nutch-site.xml file. By default, both internal/external links are set to 1.

Elements getting added in Solr index but not able to search elements as desired

I'm working with solr to store web crawling search results to be used in a search engine. The structure of my documents in solr is the following:
{
word: The word received after tokenizing the body obtained from the html.
url: The url where this word was found.
frequency: The no. of times the word was found in the url.
}
When I go the Solr dashboard on my system, which is http://localhost:8983/solr/#/CrawlerSearchResults/query I'm able to find a word say "Amazon" with the query "word: Amazon" but on directly searching for Amazon I get no results. Could you please help me out with this issue ?
Image links below.
First case
Second case (No results)
Thanks,
Nilesh.
In your second example, the value is searched against the default search field (since you haven't provided a field name). This is by default a field named _text_.
To support just typing a query into the q parameter without field names, you can either set the default field name to search in with df=wordin your URL, or use the edismax query parser (defType=edismax) and the qf parameter (query fields). qf allows multiple fields and giving them a weight, but in your case it'd just be qf=word.
Second - what you're doing seems to replicate what Lucene is doing internally, so I'm not sure why you'd do it this way (each word is what's called a "token", and each count is what's called a term frequency). You can write a custom similarity to add custom scoring based on these parameters.

Drupal Job board: Faceted Search with "OR" operator, but sort results by most matching facet criterias/term count

I'm quite stuck with searching for a solution for my problem and I hope that you can maybe help me.
In general I want to build a small job platform. It includes an "Explore"-Section, which is just like a Search-Page with Facets.
The actual job-nodes can be tagged with terms of the two vocabulary "skills" and "interests".
The facets on the search page allow the user to filter jobs exactly along these skills and interests.
However, I want to use the "OR"-Operator for the Facets, so that the user gets a list with jobs, that nearly perfect match their skills & interest but also jobs that match only some of these terms.
So, here you can see the default listing page. On the left are the Facets for interest and type (Operator "OR"). On the right, you can see the result set with title, and the node's skills & interest terms:
See the image of the Jobsearch Default page
Now, I'm applying "Musik" and "Kultur" as interest-filters:
See the image of the Jobsearch with applied filters
As you can see in the result-set, the OR-operator delivers all the results.
However, I would like to sort these results according to their "relevance" resp. according to the count of matched criterias.
The 4. and 5. results match both terms, that are selected in the facet, but they should be listed in front of all other terms.
So, I hope you understand what I want to achieve. I started at first with Views to accomplish the goal, but I then switched to search_api and SOLR as I think, that this approach is more enhanceable in the future.
The second aim is, that a user can store his/her individual interests & skills (the filters mentioned before) in his user profile. Here, the user should see individual job recommendations based on his profile on his account-page.
So, any hints, tips, tricks, links are very welcome as I have no idea if I'm on the right track to solve my problem(s). :)
Robert
Maybe this approach could be an alternative:
Instead of using the tags as facets/filters, I could use them just as search input.
when i'm typing my terms/tags within the search field of an apache-sold-search-page, i'm getting exactly the results sorted by their relevance:
Searching the tags instead of filtering
So, maybe I have just to do a small piece of code, that automatically creates a search query based on the clicked term/tagsā€¦

How do I override Solr's relevancy in a query

I am integrating a chemical structure search with Solr. To that end I am creating a Solr plugin.
The structure search returns the structure_id and it's score. Scores are values between 100 and 0 (probably would never see a 0)
I use this to create a Solr query to pull all documents that have the structure_ids. I want the results of the search to be ordered by the structure search score, not the Solr relevancy.
I generate a query that looks like this:
+structure_id:(28760263^95 OR 30392284^82 OR 47390042^70)
The problem is that in my trivial test case Solr is returning the records matching the structure_id 28760263 last. It has assigned it the lowest relevancy (4.6609402E-6)!
I wrote a function to basically amplify the score by a lot and that apparently does fix the problem however I don't think that the amplification should be necessary.
I am using Solr 3.5.
Is there some configuration that I am missing? Currently I am using Solr pretty much out of the box. The only things I've changed is to add my plugin and I edited the example docs to add structure_ids for my test case.
Is there a way to completely override the lucene scoring with the score from the structure search? We have other reasons why we would like to take control of Solr's scoring and knowing how to do that would be useful

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Resources