Batch analysing documents with solr (extracting tf idf information) - solr

Hi i want to extract the tf-idf values for terms in documents. After a bit of searching i found a request handler in the example configuration that can do that: http://localhost:8983/solr/tvrh/?q=id:documentid&qt=tvrh&tv=true&tv.all=true
What i want to do is to batch-analyse documents. This is what i do:
sending a new document to the solr update handler with commit=true
Querying solr for the term vectors using the above url
The problem is that inserting a docment with commit=true takes about 600ms which is not really acceptable for my usecase.
i then found http://wiki.apache.org/solr/RealTimeGet and tried to combine that with the termvector request handler:
<requestHandler name="/tvrh" class="solr.RealTimeGetHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
But then i get this as response when i try to query the handler: http://pastebin.com/KtB7DBSv I suppose combining those two is not possible?
How can i improve the performance anyway? Any suggestions? Is there another approach to get the tf idf values?

i did not found a solution to the specific problem in the question, but found that using softCommit=true is much more faster.

Related

Solr More Like This Says "numFound" doesn't equal number of docs in match

I have a Solr More Like This Handler, configured as follows:
Request Handler Configuration
<requestHandler name="/themlturl" class="solr.MoreLikeThisHandler">
<lst name="defaults">
<str name="wt">json</str>
<int name="rows">5</int>
<str name="mlt.fl">name, category_stack</str>
<str name="mlt.qf">name^3 category_stack^5</str>
<str name="fl">id, name</str>
<str name="mlt">true</str>
<str name="mlt.mintf">1</str>
</lst>
</requestHandler>
Simple Query
Queries that has one document match work fine
results in
Query With More Than One Document
I am trying to get documents similar to more than one document using an OR in the q field.
This results in the following response
it is clear that Solr found the three documents since the match > numFound is 3, but the returned documents in the match > docs is only one, and the results in the response are documents similar to that one document.
Does the MLT handler support multiple documents ? if not, is there a solution other than querying the handler once for each document.
What I am trying to build is a simple content-based recommendation engine which is supposed to show documents similar to the ones a user saves.

Default operator AND using SOLR on Coldfusion

I just want the default operator to be AND and not an OR for every basic search. For a particular collection, in the schema.xml and solrconfig.xml files I set the defaultOperator to AND (makes no difference) and set the mm to 100%, restart the CF Add-on Server services and still no difference when doing a search. I am on Coldfusion 2018.
<cfsearch
name='qHearings'
collection='hearings_collection'
criteria='conflicts of interest'
/>
returns me documents with words 'conflicts' OR 'interest'. If I change it to:
<cfsearch
name='qHearings'
collection='hearings_collection'
criteria='conflicts AND of AND interest'
/>
returns me documents with words 'conflicts' AND 'interest'. This is good but my users don't like be told to use AND and I hear endless comments about why can't it be like google search :(
I have been reading up on SOLR and it seems like many have the same problem but I try the suggestions but I always get an OR search result.
Anyone got basic SOLR search to default to AND?
Thank you #MatsLindh, your comments lead me to the right path! I was setting
<solrQueryParser q.op="AND"/>
in the schema.xml thinking that was where I was suppose to do it (of course, it made no difference I still got an OR search result).
I couldn't find a Solr log for Coldfusion but I played around with solrconfig.xml file for one particular collection. After re-reading your comments I added
<str name="q.op">AND</str>
to the "standard" handler and it worked! I am somewhat embarrassed because it wasn't obvious to me to do it that way and for all my googling I didn't see examples of it being done that way (I only saw it as being passed in a query parameter).
So my standard handler looks like this:
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="hl.fl">summary title </str>
<str name="df">contents</str>
<str name="q.op">AND</str>
<str name="mm">100%</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
Super embarrassing for me that the solution was so simple.

Search suggestions in django-oscar using solr

I've setup a django-oscar project and enabled solr 4.7.2 on it as per documentation.
Solr seems to be working fine. Testing the suggestions for 'exxample' (localhost:8983/solr/collection1/spell?spellcheck.q=exxample&spellcheck=true>) I get:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">10</int>
</lst>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="exxampl">
<int name="numFound">1</int>
<int name="startOffset">0</int>
<int name="endOffset">8</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst>
<str name="word">exampl</str>
<int name="freq">2</int>
</lst>
</arr>
</lst>
<bool name="correctlySpelled">false</bool>
<lst name="collation">
<str name="collationQuery">exampl</str>
<int name="hits">2</int>
<lst name="misspellingsAndCorrections">
<str name="exxampl">exampl</str>
</lst>
</lst>
</lst>
</lst>
</response>
I've also enabled OSCAR_SEARCH_FACETS to make sure that Solr has been correctly registered by Django-Oscar, and it seems to be working fine.
HOWEVER, when I do a test search for a simple misspelling in django-oscar, I get 0 returned search results and no suggestions. I'm not sure what to do next.
Help would be greatly appreciated!
I've managed to fix this problem. I'll write my complete solution to setting up Solr with spelling suggestions on Django-Oscar since setup procedures require adjustments from that described in the official documentation. This is also my first time working with Solr (or any search engine), so don't expect some expert guidance, just a guide on how to get Solr up and running on Oscar.
I am using Oscar 1.5 with Solr 4.7.2 (solutions also works for 4.10.4 ... not sure about other versions). Do everything as per documentations - note that there is a slight difference in instructions for versions of Oscar that are < 1.5.
Once you have Solr installed and running you can test out an inquiry on the Solr server # localhost:8983/solr/collection1/spell?spellcheck.q=[your search inquiry goes here; no brackets]&spellcheck=true>. Needs to be a word from your database - either in product description or product title.
You will get an error result saying that Analyzer needs to be of same type. Fix this by editing the solrconfig.xml file located at ./solr-4.7.2/example/solr/collection1/conf/solrconfig.xml. Search for <str name="field">, and change each non-commented instance to <str name="field">text</str> - you can also change each instance to <str name="field">title</str>, but this restricts to words found in titles only. Restart the Solr server. These changes will do away with the Analyzer error and your Solr server will now start showing results, however they won't yet be fed into your Oscar site.
To fix this you need to make another adjustment to the same solrconfig.xml file. Search for <requestHandler name="/select" class="solr.SearchHandler">, and at the bottom of this request handler include the following code:
<arr name="last-components">
<str>spellcheck</str>
</arr>
Restart the server. Now you have spelling suggestions in your Oscar site. Hope others have found this helpful. Like I said - this is the first time I'm using Solr. If someone has anything to add, or extend Solr functionality on Oscar it would be great.

Solr very slow filters

I have problem with very slow filters in Solr (version 4.9.1), there is ~50k documents. For first query which use specific category_id filter value, query takes ~15 seconds, second time is much more faster (it takes miliseconds). But i want to have fast filters always :) So after googling it i read that I must have filterCache and cache Autowarming
Sooo what I've done:
filterCache:
<filterCache
class="solr.FastLRUCache"
size="16384"
initialSize="4096"
autowarmCount="4096" />
firstSearcher:
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">*</str>
<str name="fq">category_id:1043</str>
</lst>
</arr>
</listener>
<useColdSearcher>true</useColdSearcher>
<useFilterForSortedQuery>true</useFilterForSortedQuery>
<maxWarmingSearchers>2</maxWarmingSearchers>
It doesn't work ;/ no idea why... For first entry on this category it takes 15s, than its fast. But I always must have fast response, for categories and for other filters.
I make an experiment, everything works better if I use mainquery instead of filters, but filters should be as fast as mainquery (i read it somewhere).
Summary:
What i'm doing wrong that autowarming dont work?
How make autowarming for each filter/each filter value?
What I'm trying to do:
Ok so, I have shop with ~50 000 products and ~1000 categories and a lot of other filters (type, price etc), my catalog is based on SOLR (filtering), now if I use filters first entry to category takes 15seconds, it must be fast every single time....
My example query:
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="debugQuery">true</str>
<str name="website_id:1"/>
<str name="stats.field">PLN_0_price_decimal</str>
<str name="product_status:1"/>
<str name="q">**</str>
<str name="store_id:1"/>
<str name="fq">category_id:10561</str>
</lst>
</lst>
So, solution was simple, I have to use * instead of ** in my query.
Part of debug section from response with *:
<str name="parsedquery">MatchAllDocsQuery(*:*)</str>
<str name="parsedquery_toString">*:*</str>
Same part of debug section from response with **:
<str name="parsedquery">textSearch:**</str>
<str name="parsedquery_toString">textSearch:**</str>
The first time you use a filter, every document needs to be looked at, even if the main query will match only a couple. You could disable caching for such filter or switch to a post-filter (by assigning filter cost). The fuller explanation is here.

Unable to retrieve computed distance in Spatial Solr queries

I'm using this plugin to allow spatial queries in Solr. I have followed the steps included in the documentation and I've got the spatial queries working fine.
Now I want to retrieve the computed distance. I added these lines in the solrconfig.xml file:
<searchComponent name="geodistance" class="nl.jteam.search.solrext.spatial.GeoDistanceComponent">
<defaults>
<str name="distanceField">geo_distance</str>
</defaults>
</searchComponent>
And I have added the "geodistance" component to the standard request handler:
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
<arr name="components">
<str>query</str>
<str>geodistance</str>
</arr>
</requestHandler>
Then, when I run a query such as "q={!spatial lat=41.641184 long=-0.894032 radius=2 calc=arc unit=km} cafeteria" it works, but only the first time. When I run the same query again I get this error:
GRAVE: java.lang.NullPointerException
at nl.jteam.search.solrext.spatial.DistanceFieldValueSource.getValues(DistanceFieldValueSource.java:57)
at nl.jteam.search.solrext.spatial.GeoDistanceComponent.process(GeoDistanceComponent.java:60)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
I have no idea where is the error because the first time the query works and I get the computed distance in the "geo_distance" field. But when repeating the query, I get a NullPointerException.
This problem is fixed in version 1.0-RC5. This version has been released some days ago.

Resources