Apache Solr search by "AND" - solr

I am working on Apache Solr.
Currently, it is working fine. When I typed in pork AND belly it will return all queries with pork and belly in it.
But I need to search pork and belly and get the same result.
But it does not as it will return all results with pork or and or belly.
The easiest way is to change it in JavaScript before sending the query.
But is there a way to do it from Apache Solr by updating the config?
Thanks.
What I did: I tried to switch it in schema.xml by adding the PatternReplaceCharFilterFactory at the dynamic field, but obviously it failed.
Any suggestions?

The eDisMax query parser accepts lower case operators by default. In your solrconfig.xml, specify that parser and you can also explicitly tell it to accept lower case operators:
<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<bool name="lowercaseOperators">true</bool>
</lst>
...
</requestHandler>

If you're using the (e)dismax query handler, just searching pork belly with q.op=AND should work fine. As long as you have a StopWordFilter configured for your set (and a proper dictionary), and will automagically removed. The default stopwords_en.txt file bundled with Solr has that in its list.

Related

Default operator AND using SOLR on Coldfusion

I just want the default operator to be AND and not an OR for every basic search. For a particular collection, in the schema.xml and solrconfig.xml files I set the defaultOperator to AND (makes no difference) and set the mm to 100%, restart the CF Add-on Server services and still no difference when doing a search. I am on Coldfusion 2018.
<cfsearch
name='qHearings'
collection='hearings_collection'
criteria='conflicts of interest'
/>
returns me documents with words 'conflicts' OR 'interest'. If I change it to:
<cfsearch
name='qHearings'
collection='hearings_collection'
criteria='conflicts AND of AND interest'
/>
returns me documents with words 'conflicts' AND 'interest'. This is good but my users don't like be told to use AND and I hear endless comments about why can't it be like google search :(
I have been reading up on SOLR and it seems like many have the same problem but I try the suggestions but I always get an OR search result.
Anyone got basic SOLR search to default to AND?
Thank you #MatsLindh, your comments lead me to the right path! I was setting
<solrQueryParser q.op="AND"/>
in the schema.xml thinking that was where I was suppose to do it (of course, it made no difference I still got an OR search result).
I couldn't find a Solr log for Coldfusion but I played around with solrconfig.xml file for one particular collection. After re-reading your comments I added
<str name="q.op">AND</str>
to the "standard" handler and it worked! I am somewhat embarrassed because it wasn't obvious to me to do it that way and for all my googling I didn't see examples of it being done that way (I only saw it as being passed in a query parameter).
So my standard handler looks like this:
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="hl.fl">summary title </str>
<str name="df">contents</str>
<str name="q.op">AND</str>
<str name="mm">100%</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
Super embarrassing for me that the solution was so simple.

How to tune apache SOLR spellcheck for desired suggestion?

Enviroment: SAP Hybris 6.7.0.0, Apache Solr 7.7.2
I am using solr to power a indie eCommerce platform. In that context we have product data in the Solr dB. For example: productName_text, BrandName_string, etc.
I've created a spellcheck component with this current configuration below:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="name">en</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="field">spellcheck_en</str>
<str name="distanceMeasure">internal</str>
<float name="accuracy">0.7</float>
<int name="maxEdits">2</int>
<int name="minPrefix">0</int>
<int name="maxInspections">5</int>
<int name="minQueryLength">2</int>
</lst>
</searchComponent>
And turned on spellcheck on /select request handler
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
and spellcheck is configured dynamically for the a single field. Suppose:
productName_text
which consists of product names from a typical electronic gadgets or it's cases. For example:
"Apple Watch Series 2 38mm Stainless Steel Case with Midnight Blue Modern Buckle Medium"
"A.O. Smith X4 RO Water Purifier (White)"
If we misspell "wath" for "watch" we get suggestion "water". Or spelling "suop maker" for "soup maker" we get "shop maker". How to tune spellchecker according to my data? Is there any other solution to implement for misbehaving queries.
Tried playing with all the spellcheck configuration from [1]: https://cwiki.apache.org/confluence/display/SOLR/SpellCheckComponent but couldn't find any solid solution yet.
Tried implementing WordBreakSolrSpellChecker, which doesn't seem to change any outcome
Played around with "spellcheck.collate" and other attributes, but it returns suggestion which has no search result.
I've observed, spellcheck is deeply affected by multivalued fields(?)
In general, How to go about the terms which should give wrong suggestion, or suggestions that are that must not come based on user preferences? Is it possible to handle two different spellcheck components, if "DirectSolrSpellChecker" does'nt give desired suggestion , I can switch to "FilebasedSpellChecker"? Can I maintain a .txt file to track all the terms which needs tuning, or the same in SAP hybris?

Apache nutch not indexing all documents to apache solr

I am using apache nutch 2.3 (latest version). I have crawled about 49000 documnts by nutch. From documents mime analysis, crawled data containes about 45000 thouseand text/html documents. But when I saw indexed documents in solr (4.10.3), only about 14000 documents are indexed. Why this huge difference between documents are (45000-14000=31000). If I assume that nutch only index text/html documents, then atleast 45000 documents should be indexed.
What is the problem. How to solve it?
In my case this problem was due to missing solr indexer infomration in nutch-site.xml. When I update config, this problem was resolved. Please check your crawler log at indexing step. In my case it was informed that no solr indexer plugin is found.
Following lines (property) are added in nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>plugin details here </description>
</property>
You should look at your Solr logs, to see if there's anything about "duplicate" documents, or just go look in the solrconfig.xml file for the core into which you are pushing the documents. There is likely a "dedupe" call is being made on the update handler, the fields used may be causing duplicate documents (based on a few fields) to be dropped. You'll see something like this
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="update.chain">dedupe</str> <<-- change dedupe to uuid
<str name="config">dih-config.xml</str> or comment the line
</lst>
</requestHandler>
and later in the file the definition of the dedupe update.chain,
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">true</bool>
-->> <str name="fields">url,date,rawline</str> <<--
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
The "fields" element is what will select which input data is used to determine the uniqueness of the record. Of course, if you know there's no duplication in your input data, this is not the issue. But the above configuration will throw out any records which are duplicate on the fields shown.
You may not be using the dataimport requestHandler, but rather the "update" requestHandler. I'm not sure which one Nutch uses. Either, way, you can simply comment out the update.chain, change it to a different processorChain such as "uuid", or add more fields to the "fields" declaration.

Batch analysing documents with solr (extracting tf idf information)

Hi i want to extract the tf-idf values for terms in documents. After a bit of searching i found a request handler in the example configuration that can do that: http://localhost:8983/solr/tvrh/?q=id:documentid&qt=tvrh&tv=true&tv.all=true
What i want to do is to batch-analyse documents. This is what i do:
sending a new document to the solr update handler with commit=true
Querying solr for the term vectors using the above url
The problem is that inserting a docment with commit=true takes about 600ms which is not really acceptable for my usecase.
i then found http://wiki.apache.org/solr/RealTimeGet and tried to combine that with the termvector request handler:
<requestHandler name="/tvrh" class="solr.RealTimeGetHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
But then i get this as response when i try to query the handler: http://pastebin.com/KtB7DBSv I suppose combining those two is not possible?
How can i improve the performance anyway? Any suggestions? Is there another approach to get the tf idf values?
i did not found a solution to the specific problem in the question, but found that using softCommit=true is much more faster.

Problems with solr queries

when I make a search against Solr in my local machine, I get a query like this:
http://localhost:8080/solr/project/select/?q=concept&version=2.2&start=0&rows=10&indent=on
But instead, I would like to get a complete query with all the settings active, filters, tokeinzer... etc.
For instans, something like this:
http://localhost:8983/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild
How can I set up this configuration??? I have tryed a lot of things and no result!! I want to know exaclty how is spellchecker working.
Thanks in advance
Change in solrconfig.xml:
<str name="echoParams">all</str>
<int name="rows">10</int>
<str name="fl">*</str>
<str name="version">2.1</str>

Resources