Integrate Nutch with Solr For Advance Search Options - solr

I am using apache-nutch-1.4 with apache-solr-3.2.0
I have successfully integrated NUTCH with SOLR
when i have queried the following
mysite/solr/select/?q=bone&version=2.2&start=0&rows=10&indent=on
It gives me following result
<doc>
<float name="boost">1.0117649</float>
<str name="cache">content</str>
<str name="content"></str>
<str name="digest">9bf016ea547cf50be81e468553c483de</str>
<str name="id">http://107.21.107.118:8000/</str>
<str name="segment">20120214151903</str>
<str name="title">Home</str>
<date name="tstamp">2012-02-14T10:19:08.215Z</date>
<str name="url">mysite:8000/</str>
</doc>
Problem is when i have to search bone in particular category like cancer or Colorectal & Digestive
then what param i need to add in above query to get records for this specific category only
mysite:8983/solr/select/?q=bone&????????
i have urls like
mysite:8000/Encyclopedia/Patient Centers/
mysite:8000/Encyclopedia/Patient Centers/Cancer/
mysite:8000/Encyclopedia/Patient Centers/Cancer/Colorectal & Digestive/
my schema.xml file looks like this which i have added in NUTCH directory also....
http://dpaste.org/MTDF2/
my reputation is not 10 so i can not make any attachment here thats why i needed to paste schema.xml on dpaste.org...
sorry for the inconvenience it may have caused.
i will realy apreciate your advice and sugessions ...

First you have to store Cancer and Colorectal & Digestive in a category field. You can use http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PathHierarchyTokenizerFactory for that. Then
the URL's could look like mysite:8983/solr/select/?q=bone&fq=category:Cancer
http://wiki.apache.org/solr/CommonQueryParameters#fq

Related

How to highlight multiple words using different formatters in Solr?

I need to perform highlighting for multiple words into the same field but for each one using a specific formatter (prefix and postfix).
Let's say that I have the description field and for a document it has the value: Einstein always excelled at math and physics from a young age. How to highlight math with a pair of a specific prefix and postfix AND ALSO physicswith a different prefix-postfix pair? So, in the end I would like to obtain:
Einstein always excelled at <em class="hl-red">math</em> and <em class="hl-green">physics</em> from a young age
The reason is that in the frontend I have different CSS classes with background-color: red; for hl-red and background-color: green for hl-green for example.
However, I was managed to highlight multiple words into the same field but with the same prefix-postfix pair all over the places, which is not what I want actually. In addition, I tried to add multiple HtmlFormatter entries in solrconfig.xml:
<highlighting>
..............
<formatter name="html" default="true" class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
<lst name="hl-red">
<str name="hl.simple.pre"><![CDATA[<em class="hl-red">]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
<lst name="hl-green">
<str name="hl.simple.pre"><![CDATA[<em class="hl-green">]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
..............
</highlighting>
but I got HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Unknown formatter: hl-green. Also, I didn't find a way to specify an array of prefixes in Solr Admin UI nor in spring-data-solr, just a simple query like this:
SimpleHighlightQuery query = new SimpleHighlightQuery(Objects.requireNonNull(criteria));
HighlightOptions highlightOptions = new HighlightOptions()
.addFields(fields)
.setSimplePrefix(prefix)
.setSimplePostfix(postfix);
query.setHighlightOptions(highlightOptions);
query.setPageRequest(pageable);
return solrTemplate.queryForHighlightPage(MY_CORE, query, MyModel.class);
My assumption is that it is a limitation of the Solr itself.
I was thinking about to write a custom fragmentsBuilder but I do not know exactly if it is the case nor how to do that. For another workaround I was thinking to execute for each word a highlight query, then to store the result, then to execute for the second word another highlight query, store the result and so on. But I don't think it is a good and elegant solution because I will have problems when executing the second query if the second word is: "em" or "class" or "red"/"green" (nested undesired highlighting will occur).
I am using spring-data-solr into a Spring Boot application and Solr 6.6.5 as a (http) service.
Does anyone know how to solve this? Please give me an advice! Any idea will be much appreciated!

Solr adds unwanted MatchAllDocsQuery and I don't know why

In my company we have a test string, which we use to ensure escaping issues are handled correctly throughout our many components:
!"§$%&/()?ß><öä€ü\ÖÄÄÜ#'
When I add a document to Solr with that title, all is well.
I now try to query that document using the same string, but with all special query parameters escaped (see here for details):
!\"§$%&\/\(\)\?ß><öä€ü\\ÖÄÄÜ#'
Surprisingly, all documents in my index match that query!
I can see in the debug output (see below), that Solr adds a MatchAllDocsQuery after my actual query. That's why all documents match, but the big question is:
Why does Solr add that match-all query? It doesn't make any sense to me.
Funnily enough, when I remove one of the escaping backslashes (e.g. the very first one before the double-quote), the query works like a charm and only finds my one expected document. For whatever reason, Solr then does not add that match_all query anymore.
!"§$%&\/\(\)\?ß><öä€ü\\ÖÄÄÜ#'
Any ideas???
Debug info:
"rawquerystring": "!\\\"§$%&\\/\\(\\)\\?ß><öä€ü\\\\ÖÄÄÜ#'",
"querystring": "!\\\"§$%&\\/\\(\\)\\?ß><öä€ü\\\\ÖÄÄÜ#'",
"parsedquery": "(+(-DisjunctionMaxQuery((((de_all:ss de_all:oa de_all:u >de_all:oaau)~4) | ((en_all:ß en_all:öä en_all:ü en_all:öääü)~4) | string_all:\"§$%&/()?ss><oa€u\\oaau#')) +MatchAllDocsQuery(*:*)))/no_coord",
"parsedquery_toString": "+(-(((de_all:ss de_all:oa de_all:u de_all:oaau)~4) | ((en_all:ß en_all:öä en_all:ü en_all:öääü)~4) | string_all:\"§$%&/()?ss><oa€u\\oaau#') +*:*)"
Request handler:
<requestHandler name="/custom" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">none</str>
<str name="wt">json</str>
<str name="defType">edismax</str>
<str name="qf">de_all^1 en_all^1 string_all^1</str>
<str name="fl">id,score</str>
<str name="indent">false</str>
</lst>
</requestHandler>
If you need any other info, please let me know!
Ahh, stupid mistake: I forgot to escape the leading '!', which makes this a query with a single negated phrase. AFAIK those are handled internally with a match all query.

Solr : Make XML as response in Solr 4.8.1

I am using solr 4.8.1.
When I execute any query for testing purpose from Dashboard I get response in JSON(BY DEFAULT)
Can I change it and make XML as default.
Plz refer below screen.
I am taking about dashboard only.
Thanks for looking here.... :)
The default values for your requestHandlers (which is what responds when a query is sent to /query or /select etc.), is set in solrconfig.xml. Here's the example from example/solr in the distribution:
<!-- A request handler that returns indented JSON by default -->
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">text</str>
</lst>
</requestHandler>
Changing wt to xml will give you a requestHandler that returns it response as XML by default, unless overridden at query time with the wt parameter. There might be parts of the web interface that assumes the response will be json, but I'm pretty sure those supply a value for wt anyway.
I dont know if there is administration for web UI defaults, but you can change html easily:
in
solr-4.8.1\example\solr-webapp\webapp\tpl\query.html
change order of options
<select name="wt" id="wt" title="The writer type (response format).">
<option>xml</option>
<option>json</option>
<option>python</option>
<option>ruby</option>
<option>php</option>
<option>csv</option>
</select>
Whatever option you put on first will be default, or set it selected:
<option selected="selected">
You may also change this html in war file in solr-4.8.1\example\webapps.
Note that path is relative to example from 4.8.1 release

OR query in Solr not working

My Solr server seems only working with AND but not OR, e.g.
/solr/select?q=(marsden AND emma)&qt=yodl_handler works, but
/solr/select?q=(marsden OR mackey)&qt=yodl_handler doesn't.
For each individual query, it returns resutls, e.g.
/solr/select?q=marsden&qt=yodl_handler returns 2 results
/solr/select?q=mackey&qt=yodl_handler returns 3 results
Any suggestions are appreciated!
Here is the definition of yodl_handler:
<requestHandler name="yodl_handler" class="solr.DisMaxRequestHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
dc.title^1 dc3.title^1 dc.type^1 vra2.logical.creator^1 vra2.image.agent.name^1 vra2.image.agent.role^1 vra2.image.agent.role^1 vra2.image.location.name^1 vra2.image.rights.rightsholder^1 vra2.image.source.refid^1 vra2.image.title^1 vra2.image.worktype^1 vra2.work.agent.attribution^1 vra2.work.agent.name^1 vra2.work.agent.role^1 vra2.work.culturalContext^1 vra2.work.description^1 vra2.work.location-type^1 vra2.logical.location.place^1 vra2.logical.location.name^1 vra2.work.location.refid^1 vra2.work.material^1 vra2.work.rights.rightsHolder^1 vra2.work.StylePeriod^1 vra2.work.subject.term.name^1 vra2.work.subject.term.place^1 vra2.work.subject.term.keyword^1 vra2.work.technique^1 vra2.work.title^1 vra2.work.worktype^1 iris2.instrument.instrumentType^1 iris2.instrument.primaryInstrumentType^1 iris2.instrument.secondaryInstrumentType^1 iris2.instrument.instrumentType.alltypes^1 iris2.instrument.author^1 iris2.references.allauthors^1 iris2.instrument.researchArea^1 iris2.instrument.typeOfFile^1 iris2.instrument.software^1 iris2.instrument.dataType^1 iris2.instrument.linguisticTarget^1 iris2.instrument.sourceLanguage^1 iris2.instrument.funder^1 iris2.instrument.licence^1 iris2.participants.participantType^1 iris2.participants.firstLanguage^1 iris2.participants.targetLanguage^1 iris2.participants.gender^1 iris2.participants.proficiencyLearner^1 iris2.participants.proficiencyStudentsTaught^1 iris2.participants.yearsOfTeachingExperience^1 iris2.participants.domainOfUse^1 iris2.references.publicationType^1 iris2.references.author^1 iris2.references.author.lastnames^1 iris2.references.booktitle^1 iris2.references.journal^1 iris2.references.publicationDate^1 iris2.references.publicationLatestDate^1 iris2.references.publisher^1 iris2.references.placeOfPublication^1 iris2.references.editor^1 iris2.references.conferenceName^1
</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="hl.fl">text features name</str>
<str name="f.name.hl.fragsize">0</str>
<str name="f.name.hl.alternateField">name</str>
<str name="f.text.hl.fragmenter">regex</str>
</lst>
</requestHandler>
The simple answer is, the Solr DisMax query parser does not support boolean logic in queries. The appearance of it working with the query involving "AND" is probably a side-effect of the way your fields are indexed (stop words?) or appears in the underlying data.
You can get a better idea what's happening under the hood if you send the debugQuery parameter, e.g.:
/solr/select?q=(marsden AND emma)&qt=yodl_handler&debugQuery=true
There's further documentation on the Solr wiki about the Dismax parser:
The Dismax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses ... but all other Lucene query parser special characters are escaped to simplify the user experience. (see DisMaxQParserPlugin)
The good news is, if you're using a modern version of Solr (3.1+), you have access to the new ExtendedDisMax parser, which DOES support boolean queries. `

solr FieldCollapsing, label and locale parameter

I'm using FieldCollapsing to group the results.
Example: I search for : and group by names, like:
http://localhost:<port>/solr/select/?q=*:*
&group=true
&group.limit=200
&group.query=Jim
&group.query=Jon
&group.query=Frank Sinatra
It looks like, solr is running (internal) an separate query for every name. (whatever)
The point is, that i have to change the search parameter local in order to set an different search operator (from OR to AND).
To get valid results I need an query like this:
http://localhost:<port>/solr/select/?q=*:*
&group=true
&group.limit=200
&group.query={!q.op=AND defType=edismax}Jim
&group.query={!q.op=AND defType=edismax}Jon
&group.query={!q.op=AND defType=edismax}Frank Sinatra
This works very well. The Problem is, that solr returns the label of the group including the locale parameter!
<lst name="grouped">
<lst name="{!q.op=AND defType=edismax}Frank Sinatra"> <---- wrong label
<int name="matches">785</int><result name="doclist" numFound="10" start="0">
<doc>
[...]
An valide result is be:
<lst name="grouped">
<lst name="Frank Sinatra">
<int name="matches">785</int><result name="doclist" numFound="10" start="0">
<doc>
[...]
Is there a way to change to label to the real term where solr is searching for?

Resources