How to count multi-valued Field in solr - solr

I Want to count multi-valued field in SOLR.
I have two multi-valued fields store_id and filter_id
and i want to count these field value like
store_id = {0,3,7}
count_store_id = 3
filter_id = {12,13,20,22,59,61,62,145}
count_filter_id = 8
and is that possible when store_id is update then count_store_id also update in solr by default
## Ashraful Islam - As you told me i'll change it but there is nothing going happen here i attach image find it.

Yes as suggested by Alexandre Rafalovitch, by using defining custom UpdaterequestProcessor you can get the count value of multivalued field.
add below lines in your solrconfig.xml
<updateRequestProcessorChain name="multivaluecountnum" default="true">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">store_id</str>
<str name="dest">store_id_count</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">filter_id</str>
<str name="dest">filter_id_count</str>
</processor>
<processor class="solr.CountFieldValuesUpdateProcessorFactory">
<str name="fieldName">store_id_count</str>
</processor>
<processor class="solr.CountFieldValuesUpdateProcessorFactory">
<str name="fieldName">filter_id_count</str>
</processor>
<processor class="solr.DefaultValueUpdateProcessorFactory">
<str name="fieldName">store_id_count</str>
<int name="value">0</int>
</processor>
<processor class="solr.DefaultValueUpdateProcessorFactory">
<str name="fieldName">filter_id_count</str>
<int name="value">0</int>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Do not forget to add RunUpdateProcessorFactory at the end of any chains you define in solrconfig.xml
Add store_id_count and filter_id_count fields in schema file
<field name="store_id_count" type="int" stored="true"/>
<field name="filter_id_count" type="int" stored="true"/>
Reindex docs and query, you will see two new fields store_id_count and filter_id_count in result.
Hope this Helps,
Vinod.

You can do this with a custom UpdateRequestProcessor chain that uses CountFieldValuesUpdateProcessorFactory.

Related

count values in multivalue field

I need to count how many fields has a multivalue field to sort results out.
solrconfig.xml
<updateRequestProcessorChain name="add-numbers-count">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">cat_ids</str>
<str name="dest">cat_ids_count</str>
</processor>
<processor class="solr.CountFieldValuesUpdateProcessorFactory">
<str name="fieldName">cat_ids_count</str>
</processor>
<processor class="solr.DefaultValueUpdateProcessorFactory">
<str name="fieldName">cat_ids_count</str>
<int name="value">0</int>
</processor>
<!-- <processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" /> -->
</updateRequestProcessorChain>
<initParams path="/update/**,/query,/select,/spell">
<lst name="defaults">
<str name="df">_text_</str>
<str name="update.chain">add-numbers-count</str>
</lst>
</initParams>
manageschema.xml
<field name="cat_ids" type="plongs"/>
<field name="cat_ids_count" type="pint"/>
Note that RunUpdateProcessorFactory and LogUpdateProcessorFactory are commented out.
If I use them update fails with a non sense error:
ERROR: [doc=44996] Error adding field 'data_readout'='2025-06-01' msg=Invalid Date String:'2025-06-01'int(44996)
Solr is not creating this field cat_ids_count I guess because there is no RunUpdateProcessorFactory.
Do I have to delete and recreate collection? Or is there any error I can't see?

SolrCloud Deduplication Overwrite isn't working

I've been struggling to get Deduplication to work in SolrCloud (version 8.6). My solrconfig.xml contains:
<updateRequestProcessorChain name="dedupeOn">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">dedupeId</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">journal_doi,internal_pmid</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
and
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupeOn</str>
</lst>
</requestHandler>
my managedschema contains:
<field name="dedupeId" type="string" indexed="true" stored="true" multiValued="false" />
In my test, I add 1000 documents, and commit manually. I see the "dedupeId" is created with the hash.
I then add 10 more documents that I know are duplicates, and again commit manually. These 10 rows are added, and the original document with the matching dedupeId is not overwritten. For example:
"response":{"numFound":2,"start":0,"maxScore":2.1554677,"numFoundExact":true,"docs":[
{
"internal_pmid":"13367837",
"dedupeId":"7f0306ecd909a68e",
"journal_doi":"10.1097/00005053-195603000-00006"},
{
"internal_pmid":"13367837",
"dedupeId":"7f0306ecd909a68e",
"journal_doi":"10.1097/00005053-195603000-00006"}]
}}
I'm not sure if its significant, but in the solr logs, I see some "add" entries that contain, in part:
webapp=/solr path=/update params={update.distrib=TOLEADER&update.chain=dedupeOn&distrib.from=*(shard path)*/&wt=javabin&version=2}{add=[00001hLxMb (1690871781072568320)]} 0 2
but other add entries do not contain the update.chain property e.g.
webapp=/solr path=/update params={wt=javabin&version=2}{add=[00000sta0n (1690871780667817984)]} 0 2
Any help would be greatly appreciated.

Solr deduplication error while indexing nutch data

I had integrated nutch 2.3.1 with solr 6.5, with this I could push data to solr and get indexed. Now I want to remove duplicate elements and for this I made the modifications in schema.xml and solrconfig.xml
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id,content,date,url</str> <!-- changing to id <str name="fields">name,features,cat</str>-->
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
but after indexing bin/nutch solrindex http://localhost:8983/solr/testcore -all
error !!
please help me to sort out this issue
thanking you in advance :)
This issue might be related to the schema updated, if you have some data existing in Solr and you updated the schema while that data exist in the core, Nutch will take it as a mismatch Schema, best way to fix this issue is re-crawling the webpage with the schema updated and keep in mind that any update to the schema will/could probably cause issues with you existing index.
Since post is already old, for future reference for people that could have the same issue.
Best :)

Is there a multiValue field sort workaround in solr

I am trying to look for alternative ways to sort a multivalue field.
I know that this question has been asked before and the solutions talk about min and max but that is not the strategy i am looking for.
Is there a way we can do a COPY of the multivalue over to another field which can be used for sorting?
For example like this:
<field name="cat" type="string" indexed="true" stored="true"
multiValued="true"/>
<copyField source="cat" dest="firstcat"/>
<field name="firstcat" type="string" indexed="true" stored="false"
multiValued="false"/>
Answering my question.
The copyfield above will not work and will throw an exception when there is more than one value in the multivalue string. I mean, duh. Obviously.
One working solution is to use the updateRequestProcessorChain configuration in the solrconfig.xml and add it to the update handler chain.
Here is a sample:
<updateRequestProcessorChain name="concatFields">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">str1</str>
<str name="dest">str2</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="fieldName">str2</str>
<str name="delimiter">_</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">str2</str>
<str name="dest">str3</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
And then chain the processor to the path:
<initParams path="/update/**">
<lst name="defaults">
<str name="update.chain">concatFields</str>
</lst>
</initParams>

Basic UIMA with SOLR

I am trying to connect UIMA with Solr. I have downloaded the Solr 3.5 dist and have it successfully running with nutch and tika on windows 7 using solrcell and curl via cygwin.
To begin, I copied the 6 jars from solr/contrib/uima/lib to the working /lib in solr.
Next, I read the readme.txt file in solr/contrib/uima/lib and edited both my solrconfig.xml and schema.xml to no avail.
I then found this link which seemed a bit more applicable since I didnt care to use Alchemy or OpenCalais: http://code.google.com/a/apache-extras.org/p/rondhuit-uima/?redir=1
Still- when I run a curl command that imports a pdf via solrcell I do not get the additional UIMA fields nor do I get anything on my logs. The test.pdf is parsed though and I see the pdf in Solr using:
curl 'http://localhost:8080/solr/update/extract?fmap.content=content&literal.id=doc1&commit=true' -F "file=#test.pdf"
SolrConfig.XML
<updateRequestProcessorChain name="uima">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
<str name="host">http://localhost</str>
<str name="port">8080</str>
</lst>
<str name="analysisEngine">C:\uima\desc\com\rondhuit\uima\desc\NextAnnotatorDescriptor.xml</str>
<bool name="ignoreErrors">true</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>content</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">com.rondhuit.uima.next.NamedEntity</str>
<lst name="mapping">
<str name="feature">entity</str>
<str name="fieldNameFeature">uname</str>
<str name="dynamicField">*_sm</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">uima</str>
</lst>
</requestHandler>
AND I ALSO ADJUSTED MY requestHander:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
Schema.XML
<!-- fields for UIMA -->
<field name="uname" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<dynamicField name="*_sm" type="string" indexed="true" stored="true"/>
All I am trying to do is have UIMA pull out names from text (just to start as a demo) and cannot figure out what I am doing wrong.
Thank you in advance for reading this.
Not sure if this ever got addressed, but in case someone else is looking, I had this same problem yesterday. Figured out that I was calling /update/extract to use solrcell, which doesn't use uima because it's integrated into /update.

Resources