I am indexing set of documents with a field called 'route_field'. This field has been set as router field also. My collection has 10 shards(e.g shard_1..shard_10), Currently I am unable to route the documents to the desired by router_field population. e.g if router_field is set to 'shard_1', it should be stored into shard_1. I have followed the documentation but the routing does not seem to take place and solr is choosing shards on its own.
Related
Nutch by default enables the scoring-opic plugin. From my understanding, the scoring plugin is responsible for setting the score of each url in the crawldb. This score will be used in two ways:
During the generation of a new segment (fetch list) with -topN, the score determines which urls will be part of the fetch list (those urls with the highest scores will be part of the fetch list).
During indexing into Solr using the indexer-solr plugin, the score will be used to set the boost of the document indexed into Solr.
Please correct me if I am wrong about any of the above.
For my use case:
I want to disable boosts when indexing into Solr.
As I am crawling only a few URLs, and I do not want links from/to outside each individual URL to affect the score. For example, if there is a link from http://siteA.com to http://siteB.com, siteB's score should not be affected. Whereas if there is a link from http://siteA.com/first to http://siteA.com/second, I want the score for http://siteA.com/second to increase.
What setting can I tweak to accomplish these two goals?
Regarding your first question you could remove the boost field from the Solr Index Writer mapping (take a look at https://cwiki.apache.org/confluence/display/nutch/IndexWriters#Mapping_section). This should avoid sending the field to Solr.
Regarding the URL scoring for internal/external links, you could try changing the scoring config in the nutch-site.xml file. By default, both internal/external links are set to 1.
I want to manipulate doc and change the token value for field(s) by prepending some value to each token. I am doing bulk update through DIH and also posting Documents through SOLRJ. I have replication factor as 2, so Replication should also work. The value that I want to prepend is there in the document as a separate field. I am interested to know the place where I can intercept the document before the indexing so that I can manipulate it. One of the option I can think of overriding DirectUpdateHandler2. Is this the right place?
I can do it by externally processing the document and passing it to SOLR But I want to do it inside SOLR.
Document fields are :
city:mumbai
RestaurantName:Talk About
Keywords:Cofee, Chines, South Indian, Bar
I want to index keywords as
mumbai_cofee
mumbai_Chines
mumbai_South Indian
mumbai_Bar
the right place is an Update Request Processor, you make sure you plug that in sorlconfig.xml into all udpate handlers you are using (including DIH), and the single URP will cover all updates.
In your java code in the URP you can easily get the value of a field and then prepend it to all the others in another field etc. This happens before the doc is indexed.
I am indexing documents into solr from a source. At source, for each document, i have some associated properties which i am indexing & fetching into solr.
What i am doing is i am mapping some fields from source properties with solr schema fields. But i could see couple of extra fields in solr logs which i am not mapping. While querying in solr admin UI, i could see only mapped fields.
E.g. In below logs, i am using only content_name & content content_modifier but i could see Template fields also.
INFO - 2014-09-18 12:07:47.185; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.content_name=1_.000&literal.content_modifier=System&literal.Template={8ad4d8f0-93a7-4941-9657-cf3706f00409} {add=[1_.000 (1479581071766978560)]} 0 0
So whats happening here? Will solr index only mapped fields and skip rest of unmapped ones? Or will solr index all fields including mapped & non-mapped but on admin UI , it will show only mapped fields?
Please suggest.
Your question is defined by what your solrconfig and schema say because you can configure it any way you want. Here is how it works for the example schema for Solr 4.10:
1) In solrconfig.xml, the handler use "uprefix" parameter to map all fields NOT in schema to a dynamic field ignored_*
2) In schema.xml, that dynamic field has type ignored
3) Type ignored (in the same file) is defined as stored=false and indexed=false. Which means do not complain if you get one of fields with matching pattern, but do nothing with, literally ignore.
So, if you don't like that, you can modify any part of that pipeline. The easiest test would be to change the dynamic field to use type string and reindex. Then, you should see the rest of the fields.
After field query execution, for given search term(s), SOLR APIs are returning the doc Ids.
My question is there a way to fetch minimal set of fields which contains only end user search terms?
For example, I have a SOLR document with nearly 200 attributes
My query is (name:SOLR* OR Description:LUCENE*) AND (Publisher:Print* OR AUTHOR:ERIC etc)
In the above example, if name field matches, i want only name and so on
I have thousands of documents indexed in my SOLR which represents data crawled from different websites. One of the fields of a document is SourceURL which contains the url of a webpage that I crawled and indexed into this Document.
I want to boost results from a specific website using boost query.
For example I have 4 documents each containing in SourceURL the following data
https://meta.stackoverflow.com/page1
http://www.stackoverflow.com/page2
https://stackoverflow.com/page3
https://stackexchange.com/page1
I want to boost all results that are from stackoverflow.com, and not subdomains (in this case result 2 and 3 ).
Do you know how can I index the url field and then use boost query to identify all the documents from a specific website like in the case above ?
One way would be to parse the url prior to index time and specify if it is a primary domain ( primarydomain boolean field in your schema.xml file for example).
Then you can boost the primarydomain field in your query results. See using the DisMaxQParserPlugin from the Solr Wiki for an example on how to boost fields at query time.