Solr Deduplication (dedupe) giving all zeros in signatureField - solr

I've followed the examples listed in the documentation here: http://wiki.apache.org/solr/Deduplication and https://cwiki.apache.org/confluence/display/solr/De-Duplication
However, when analyzing the results every signatureField gets returned like so:
0000000000000000
I can't seem to figure out why a unique signature isn't being generated.
Relevant config sections:
solrconfig.xml
<requestHandler name="/update"
class="solr.XmlUpdateRequestHandler">
<!-- See below for information on defining
updateRequestProcessorChains that can be used by name
on each Update Request
-->
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
...
<!-- Deduplication
An example dedup update processor that creates the "id" field
on the fly based on the hash code of some other fields. This
example has overwriteDupes set to false since we are using the
id field as the signatureField and Solr will maintain
uniqueness based on that anyway.
-->
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signatureField</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
schema.xml
<fields>
<!-- Valid attributes for fields:
name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the
<types> section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with
this field (this disables length normalization and index-time
boosting for the field, and saves some memory). Only full-text
fields or fields that need an index-time boost need norms.
Norms are omitted for primitive (non-analyzed) types by default.
termVectors: [false] set to true to store the term vector for a
given field.
When using MoreLikeThis, fields used for similarity should be
stored for best performance.
termPositions: Store position information with the term vector.
This will increase storage costs.
termOffsets: Store offset information with the term vector. This
will increase storage costs.
default: a value that should be used if no value is specified
when adding a document.
-->
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
... etc
I'm wondering if anyone can steer me in the right direction?

Related

Solr - ignore predefined words

My Search application leverages Solr in order to search on some wikis and forums content.
Sometimes vulgar words appear in posts and consequently they are indexed in Solr and appear in suggestions and searches as well.
Is there a way for Solr to ignore a set of predefined words considered vulgar?
The user case would be the following. We have:
A) a schema like:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="string" indexed="true" stored="true" >
<field name="body" type="string" indexed="true" stored="true" >
B) a text file containing the vulgar words to ignore: words_to_ignore.txt. For instance it would contain:
badword1 badword2
C) A wiki having title "my wiki badword1" ;
If we ran the query:
http://localhost:8983/my_wiki_collection/select?q=name:(wiki+AND+badword1)
We would expect Solr to return the document:
<doc>
<str name="id">abcd-acdf-a1ga</str>
<str name="name">my wiky</str>
<str name="body">This is my amazing wiki</str>
</doc>
Just add them to your stopwords list.
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory

Query multiple collections with different fields in solr

Given the following (single core) query's:
http://localhost/solr/a/select?indent=true&q=*:*&rows=100&start=0&wt=json
http://localhost/solr/b/select?indent=true&q=*:*&rows=100&start=0&wt=json
The first query returns "numFound":40000"
The second query returns "numFound":10000"
I tried putting these together by:
http://localhost/solr/a/select?indent=true&shards=localhost/solr/a,localhost/solr/b&q=*:*&rows=100&start=0&wt=json
Now I get "numFound":50000".
The only problem is "a" has more columns than "b". So the multiple collections request only returns the values of a.
Is it possible to query multiple collections with different fields? Or do they have to be the same? And how should I change my third url to get this result?
What you need is - what I call - a unification core. That schema itself will have no content, it is only used as a sort of wrapper to unify those fields you want to display from both cores. In there you will need
a schema.xml that wraps up all the fields that you want to have in your unified result
a query handler that combines the two different cores for you
An important restriction beforehand taken from the Solr Wiki page about DistributedSearch
Documents must have a unique key and the unique key must be stored (stored="true" in schema.xml) The unique key field must be unique across all shards. If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic.
As example, I have shard-1 with the fields id, title, description and shard-2 with the fields id, title, abstractText. So I have these schemas
schema of shard-1
<schema name="shard-1" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
schema of shard-2
<schema name="shard-2" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
To unify these schemas I create a third schema that I call shard-unification, which contains all four fields.
<schema name="shard-unification" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
Now I need to make use of this combined schema, so I create a query handler in the solrconfig.xml of the solr-unification core
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
That's it. Now some index-data is required in shard-1 and shard-2. To query for a unified result, just query shard-unification with appropriate shards param.
http://localhost/solr/shard-unification/select?q=*:*&rows=100&start=0&wt=json&shards=localhost/solr/shard-1,localhost/solr/shard-2
This will return you a result like
{
"responseHeader":{
"status":0,
"QTime":10},
"response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
{
"id":1,
"title":"title 1",
"description":"description 1",
"score":1.0},
{
"id":2,
"title":"title 2",
"abstractText":"abstract 2",
"score":1.0}]
}}
Fetch the origin shard of a document
If you want to fetch the originating shard into each document, you just need to specify [shard] within fl. Either as parameter with the query or within the requesthandler's defaults, see below. The brackets are mandatory, they will also be in the resulting response.
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score,[shard]</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
Working Sample
If you want to see a running example, checkout my solrsample project on github and execute the ShardUnificationTest. I have also included the shard-fetching by now.
Shards should be used in Solr
When an index becomes too large to fit on a single system, or when a single query takes too long to execute
so the number and names of the columns should always be the same. This is specified in this document (where the previous quote also come from):
http://wiki.apache.org/solr/DistributedSearch
If you leave your query as it is and make the two shards with the same fields this shoudl just work as expected.
If you want more info about how the shards work in SolrCould have a look at this docuemtn also:
http://wiki.apache.org/solr/SolrCloud

Solr schema field

I've made a schema for solr and I don't know the name of every field from the document I want to add, so I defined a dynamicField like this:
<dynamicField name="*" type="text_general" indexed="true" stored="true" />
Right now I'm testing and I don't get an error when importing for undefined fields in the document, but when I try to query for *:something (anything other than "*") I don't get any results back.
My question is how can I define a catch all field, is there any right way to do this? Or am I under the wrong impression that a query for *:something would normally search in all the documents and all the fields for "something"?
The search key word `*:something` can not get anything from solr, no matter what kind of field you are using, dinamicField or not.
If I understand your question correctly, you want a dynamicField to store all fields and want to query all fields laterly.
Here is my solution.
First, defining a default_search field for search:
<field name="default_search" type="text" indexed="true" stored="true" multiValued="true"/>
And then copy all fields into the default_search field.
<copyField source="*" dest="default_search" />
Finally, you can make a query for all fields like this:
http://host/core/select/?q=something
or
http://host/core/select/?q=default_search:something
AFAIK *:something does not query all the fields. It looks for a field names *.
I get the below error when attempting to do a query for *:test
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">9</int>
<lst name="params">
<str name="wt">xml</str>
<str name="q">*:test</str>
</lst>
</lst>
<lst name="error">
<str name="msg">undefined field *</str>
<int name="code">400</int>
</lst>
</response>
You would need to define a catchall field using copyField in your schema.xml.
I would recommend not using a simple wildcard for dynamic fields. Instead something like this:
<dynamicField name="*_text" type="text_general" indexed="true" stored="true" />
and then have a catchall field
<field name="CatchAll" type="text_general" indexed="true" stored="true" multiValued="false" />
You can have a copyField defined as below, to support query such as q=something
<copyField source="*_text" dest="CatchAll" />

solr spatial search with distance to search results

I'm able to return all results within a specific radius from geolocation point A, but I want to return the distance of each search result to point A.
I was reading this: http://wiki.apache.org/solr/SpatialSearch
I have this Solr query:
http://localhost:8983/solr/tt/select/?indent=on&facet=true&fq={!geofilt}&pt=51.4416420,5.4697225&sfield=geolocation&d=20&sort=geodist()%20asc&q=*:*&start=0&rows=10&fl=_dist_:geodist(),id,title,lat,lng,geolocation,location&facet.mincount=1
And this in my schema.xml
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<field name="geolocation" type="location" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
This is one of the results:
<doc>
<str name="geolocation">51.4231086,5.474830699999984</str>
<str name="id">122</str>
<str name="lat">51.4231086</str>
<str name="lng">5.474830699999984</str>
<str name="title">Eindhoven Museum</str>
</doc>
However, with my current query string, I don't see a distance field in the document.
What am I missing?

Searching multiple fields in SOLR

I am trying to search on 2 fields without having to specify a field name in the query. In my schema.xml I have added 2 fields that correspond to 2 columns in a database table.
<field name="title" type="string" indexed="true" stored="true" required="true"/>
<field name="description" type="string" indexed="true" stored="true"/>
In addition I added a 3rd field which I want to use as a destination in "copyField"
and also as the "defaultSearchField"
<field name="combinedSearch" type="string" indexed="true" stored="true" multiValued="true"/>
<copyField source="*" dest="combinedSearch"/>
<uniqueKey>title</uniqueKey>
<defaultSearchField>combinedSearch</defaultSearchField>
Now in the Solr Admin UI, if I enter some title it will return results but if I enter some description it won't return anything.
It seems only the first field is used for searching. Am I using copyField and defaultSearchField in the right way?
I've restarted the solr server and regenerated the index.
Thanks.
Probably it ends in the same result, but for your information, i use copyField at the end of the schema.xml (but i dont think, the order is relevant) in the following syntax:
<copyField source="title" dest="combinedSearch" />
<copyField source="description" dest="combinedSearch" />
next:
<field name="combinedSearch" type="string"
If type="text" is the better choise depends on the definition of "string". If you are using default fieldTypes, type="string" could better for your case, because for string there is no analyzing per default, which means (probably) there is also no tokenyzing.
//update
An other way instead of copyfields is to use the (e)dsimax query parser. On solrconfig.xml you can specify all the field you like to search by default, like this:
<requestHandler name="/select" class="solr.SearchHandler" default="true">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="defType">edismax</str>
<float name="tie">0.01</float>
<bool name="tv">true</bool>
<str name="qf">
title^1 description^1
</str>
...
Try change your combinedSearch type to text and then regenerate the index.
Here's how I approached it. Instead of using * alias, I defined which fields to copy to my combined field. I also sat multiValued to false on my normal fields (title and description). Instead of defining my fields as string, I used "text_general" - both for my normal fields and my combined field.
Furthermore I set "stored=false" on my combined field, as I don't need to return the value, as it is only used for searching - in my case at least.
<field name="title" type="text_general" indexed="true" stored="true" required="true" multiValued="false" />
<field name="description" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="combinedSearch" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="title" dest="combinedSearch"/>
<copyField source="description" dest="combinedSearch"/>
<uniqueKey>title</uniqueKey>
<defaultSearchField>combinedSearch</defaultSearchField>

Resources