My Search application leverages Solr in order to search on some wikis and forums content.
Sometimes vulgar words appear in posts and consequently they are indexed in Solr and appear in suggestions and searches as well.
Is there a way for Solr to ignore a set of predefined words considered vulgar?
The user case would be the following. We have:
A) a schema like:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="string" indexed="true" stored="true" >
<field name="body" type="string" indexed="true" stored="true" >
B) a text file containing the vulgar words to ignore: words_to_ignore.txt. For instance it would contain:
badword1 badword2
C) A wiki having title "my wiki badword1" ;
If we ran the query:
http://localhost:8983/my_wiki_collection/select?q=name:(wiki+AND+badword1)
We would expect Solr to return the document:
<doc>
<str name="id">abcd-acdf-a1ga</str>
<str name="name">my wiky</str>
<str name="body">This is my amazing wiki</str>
</doc>
Just add them to your stopwords list.
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
Related
I've followed the examples listed in the documentation here: http://wiki.apache.org/solr/Deduplication and https://cwiki.apache.org/confluence/display/solr/De-Duplication
However, when analyzing the results every signatureField gets returned like so:
0000000000000000
I can't seem to figure out why a unique signature isn't being generated.
Relevant config sections:
solrconfig.xml
<requestHandler name="/update"
class="solr.XmlUpdateRequestHandler">
<!-- See below for information on defining
updateRequestProcessorChains that can be used by name
on each Update Request
-->
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
...
<!-- Deduplication
An example dedup update processor that creates the "id" field
on the fly based on the hash code of some other fields. This
example has overwriteDupes set to false since we are using the
id field as the signatureField and Solr will maintain
uniqueness based on that anyway.
-->
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signatureField</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
schema.xml
<fields>
<!-- Valid attributes for fields:
name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the
<types> section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with
this field (this disables length normalization and index-time
boosting for the field, and saves some memory). Only full-text
fields or fields that need an index-time boost need norms.
Norms are omitted for primitive (non-analyzed) types by default.
termVectors: [false] set to true to store the term vector for a
given field.
When using MoreLikeThis, fields used for similarity should be
stored for best performance.
termPositions: Store position information with the term vector.
This will increase storage costs.
termOffsets: Store offset information with the term vector. This
will increase storage costs.
default: a value that should be used if no value is specified
when adding a document.
-->
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
... etc
I'm wondering if anyone can steer me in the right direction?
Given the following (single core) query's:
http://localhost/solr/a/select?indent=true&q=*:*&rows=100&start=0&wt=json
http://localhost/solr/b/select?indent=true&q=*:*&rows=100&start=0&wt=json
The first query returns "numFound":40000"
The second query returns "numFound":10000"
I tried putting these together by:
http://localhost/solr/a/select?indent=true&shards=localhost/solr/a,localhost/solr/b&q=*:*&rows=100&start=0&wt=json
Now I get "numFound":50000".
The only problem is "a" has more columns than "b". So the multiple collections request only returns the values of a.
Is it possible to query multiple collections with different fields? Or do they have to be the same? And how should I change my third url to get this result?
What you need is - what I call - a unification core. That schema itself will have no content, it is only used as a sort of wrapper to unify those fields you want to display from both cores. In there you will need
a schema.xml that wraps up all the fields that you want to have in your unified result
a query handler that combines the two different cores for you
An important restriction beforehand taken from the Solr Wiki page about DistributedSearch
Documents must have a unique key and the unique key must be stored (stored="true" in schema.xml) The unique key field must be unique across all shards. If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic.
As example, I have shard-1 with the fields id, title, description and shard-2 with the fields id, title, abstractText. So I have these schemas
schema of shard-1
<schema name="shard-1" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
schema of shard-2
<schema name="shard-2" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
To unify these schemas I create a third schema that I call shard-unification, which contains all four fields.
<schema name="shard-unification" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
Now I need to make use of this combined schema, so I create a query handler in the solrconfig.xml of the solr-unification core
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
That's it. Now some index-data is required in shard-1 and shard-2. To query for a unified result, just query shard-unification with appropriate shards param.
http://localhost/solr/shard-unification/select?q=*:*&rows=100&start=0&wt=json&shards=localhost/solr/shard-1,localhost/solr/shard-2
This will return you a result like
{
"responseHeader":{
"status":0,
"QTime":10},
"response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
{
"id":1,
"title":"title 1",
"description":"description 1",
"score":1.0},
{
"id":2,
"title":"title 2",
"abstractText":"abstract 2",
"score":1.0}]
}}
Fetch the origin shard of a document
If you want to fetch the originating shard into each document, you just need to specify [shard] within fl. Either as parameter with the query or within the requesthandler's defaults, see below. The brackets are mandatory, they will also be in the resulting response.
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score,[shard]</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
Working Sample
If you want to see a running example, checkout my solrsample project on github and execute the ShardUnificationTest. I have also included the shard-fetching by now.
Shards should be used in Solr
When an index becomes too large to fit on a single system, or when a single query takes too long to execute
so the number and names of the columns should always be the same. This is specified in this document (where the previous quote also come from):
http://wiki.apache.org/solr/DistributedSearch
If you leave your query as it is and make the two shards with the same fields this shoudl just work as expected.
If you want more info about how the shards work in SolrCould have a look at this docuemtn also:
http://wiki.apache.org/solr/SolrCloud
I am trying to use Solr's CurrencyField. I am using the example Solr instance (apache-solr-4.0.0/example/solr/collection1) to test the CurrencyField. I have added a field to the schema.xml as follows:
<field name="money" type="currency" indexed="true" stored="false" required="true" multiValued="false" />
However, when posting the XML file:
<doc>
<field name="id">12344321</field>
<field name="text">4312341</field>
<field name="money">1.30,USD</field>
</doc>
I get the following error:
SEVERE: org.apache.solr.common.SolrException: [doc=4312341] missing required field: money
Why am I getting this error, and how can I fix it?
I am using Solr 4.0.0
Paige is correct: You are getting this error because CurrencyField is a PolyField.
The following document shows three fields: the field "money", and two special dynamic fields "__raw_amount" and "__currency".
<doc>
<field name="money">1.30,USD</field>
</doc>
A workaround to keeping the "money" field not stored is to include it as a dynamic field.
<dynamicField name="*_c" type="currency" indexed="true" stored="false" />
My guess is that Solr dynamically generates new stored fields for both the raw amount and the currency.
That said, this question is a great candidate for the mailing list.
After a lot of trial and error, I discovered a solution: the money field must have stored="true" in the schema.xml.
<field name="money" type="currency" indexed="true" stored="true" required="true" multiValued="false" />
I do not know why this works.
I am a novice with Solr and i was trying the example that comes in the example folder of Solr(3.6) package(apache-solr-3.6.0.tgz). I started the server and posted the sample xml files in example/exampledocs and then i could search for stuff and Solr would return matches and it was all good. But then i tried posting another xml file with more than 10,000 documents. I modified the example/solr/conf/schema.xml file to add the fields of my xml file and then restarted the server and posted my xml file. I checked the statistics in Solr admin panel(http://localhost:8983/solr/admin/stats.jsp) and it shows numDocs : 10020. Now this means that the documents were successfully posted. But when i search for anything that was present in my posted documents(from the 10,000 document xml file),it returns 0 results. But Solr is still able to return results from searches that match content in the documents that come by default in the example/exampledocs folder. I am clueless about what has happened here. The value of numDoc clearly suggests that the documents i posted in the xml file were indexed.
Anything else i can inspect to see what's wrong with this?
The schema which comes in the example with the Solr package is like this
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/><field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true"/>
and more....
The schema of the xml file which i posted had some fields in common with the above schema like title,description,price,etc so i entered the rest of the fields in schema.xml like this
<field name="cid" type="int" indexed="false" stored="false"/>
<field name="discount" type="float" indexed="true" stored="true"/>
<field name="link" type="string" indexed="true" stored="true"/>
<field name="status" type="string" indexed="true" stored="true"/>
<field name="pubDate" type="string" indexed="true" stored="true"/>
<field name="image" type="string" indexed="false" stored="false"/>
If you are using the default settings from the Solr example site, then by virtue of the df setting in the solrconfig.xml file for the /select request handler, it is setting the default search field to the text field.
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
....
</requestHandler>
If you look in the schema.xml file just below the field definitions you will see the multiple copyField settings that are moving the values from certain fields into the text field and therefore making them searchable via the default field setting. In your example of searching for Sony in the title field, if you look at the copyField statements, you will see that the title field is not being copied to the text default search field. Therefore, the documents with the Sony title value are not being returned in your query.
I would suggest the following:
Try a query by specifying the following: title:Sony that should return what you are expecting.
If you want the title field to be included in the default query field, then add the following copyField statement to the schema.xml file and reload your 10000 document file.
<copyField source="title" dest="text">
I hope this helps.
I am trying to search on 2 fields without having to specify a field name in the query. In my schema.xml I have added 2 fields that correspond to 2 columns in a database table.
<field name="title" type="string" indexed="true" stored="true" required="true"/>
<field name="description" type="string" indexed="true" stored="true"/>
In addition I added a 3rd field which I want to use as a destination in "copyField"
and also as the "defaultSearchField"
<field name="combinedSearch" type="string" indexed="true" stored="true" multiValued="true"/>
<copyField source="*" dest="combinedSearch"/>
<uniqueKey>title</uniqueKey>
<defaultSearchField>combinedSearch</defaultSearchField>
Now in the Solr Admin UI, if I enter some title it will return results but if I enter some description it won't return anything.
It seems only the first field is used for searching. Am I using copyField and defaultSearchField in the right way?
I've restarted the solr server and regenerated the index.
Thanks.
Probably it ends in the same result, but for your information, i use copyField at the end of the schema.xml (but i dont think, the order is relevant) in the following syntax:
<copyField source="title" dest="combinedSearch" />
<copyField source="description" dest="combinedSearch" />
next:
<field name="combinedSearch" type="string"
If type="text" is the better choise depends on the definition of "string". If you are using default fieldTypes, type="string" could better for your case, because for string there is no analyzing per default, which means (probably) there is also no tokenyzing.
//update
An other way instead of copyfields is to use the (e)dsimax query parser. On solrconfig.xml you can specify all the field you like to search by default, like this:
<requestHandler name="/select" class="solr.SearchHandler" default="true">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="defType">edismax</str>
<float name="tie">0.01</float>
<bool name="tv">true</bool>
<str name="qf">
title^1 description^1
</str>
...
Try change your combinedSearch type to text and then regenerate the index.
Here's how I approached it. Instead of using * alias, I defined which fields to copy to my combined field. I also sat multiValued to false on my normal fields (title and description). Instead of defining my fields as string, I used "text_general" - both for my normal fields and my combined field.
Furthermore I set "stored=false" on my combined field, as I don't need to return the value, as it is only used for searching - in my case at least.
<field name="title" type="text_general" indexed="true" stored="true" required="true" multiValued="false" />
<field name="description" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="combinedSearch" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="title" dest="combinedSearch"/>
<copyField source="description" dest="combinedSearch"/>
<uniqueKey>title</uniqueKey>
<defaultSearchField>combinedSearch</defaultSearchField>