Apache Solr Facet Search with Space - solr

I am new to Solr Facet Search. I am searching some data using Apache Solr search, I had used Facet for some column to get the count but if there is a space or special character in that field it has been taken into count separately. I had used the solution in this link Apache Solr facet search exclude space to avoid space but still my problem persists
My altered Schema.XML file after seeing the above link is
<schema name="solr_quickstart" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_not_tokenized" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="int" class="solr.TrieIntField"/>
<fieldType name="UUIDField" class="solr.UUIDField"/>
</types>
<fields>
<field name="id" type="UUIDField" indexed="true" stored="true"/>
<field name="caseid" type="int" indexed="true" stored="true"/>
<field name="casenumber" type="text" indexed="true" stored="true"/>
<field name="casestatus" type="text" indexed="true" stored="true"/>
<field name="casetype" type="text" indexed="true" stored="true"/>
<field name="closeddate" type="text" indexed="true" stored="true"/>
<field name="courtname" type="text" indexed="true" stored="true"/>
<field name="courtabbr" type="text" indexed="true" stored="true"/>
<field name="fileddate" type="text" indexed="true" stored="true"/>
<field name="judgename" type="text" indexed="true" stored="true"/>
<field name="lastupdated" type="text" indexed="true" stored="true"/>
<field name="maindefendant" type="text" indexed="true" stored="true"/>
<field name="mainplaintiff" type="string" indexed="true" stored="true"/>
<field name="all" type="string" docValues="true" indexed="true" stored="false" multiValued="true"/>
</fields>
<defaultSearchField>casenumber</defaultSearchField>
<uniqueKey>id</uniqueKey>
<copyField source="casenumber" dest="all"/>
<copyField source="casestatus" dest="all"/>
<copyField source="casetype" dest="all"/>
<copyField source="courtname" dest="all"/>
<copyField source="courtabbr" dest="all"/>
<copyField source="judgename" dest="all"/>
<copyField source="maindefendant" dest="all"/>
<copyField source="mainplaintiff" dest="all"/>
</schema>
kindly anyone guide me in the right way of configuring my Schema.XML file

Your problem is the tokenizer.
This splits the field-value into different terms and every term get it's own count in facet queries. To avoid this, you could remove the tokenizer (ore use an other tokenizer). The result will be, that the whole field will be one term. This is a problem, if you have mar than one "subject" in your textfield.
I had an equal problem and tried to use the protected words, wich will not be applied on the tokenizer. It's more (only?) for stemming: solr not tokenizing protected words

Related

Access Denied trying to create Solr Config

I'm following the example at:
https://github.com/watson-developer-cloud/node-sdk/blob/master/examples/retrieve_and_rank_solr.v1.js
But everytime I try and upload a config I get
"Error: Unauthorized: Access is denied due to invalid credentials."
I've made an API key for Retrieve and Rank, are there more things to do to manage the credentials for R&R?
Here's my code:
return retrieveInstance.uploadConfigAsync({
cluster_id: clusterId,
config_name: watsonConfig.config_name,
config_zip_path: (__dirname + "/../../" + watsonConfig.config_path)
});
I'm successfully creating a cluster with this API key.
Schema.zip has this schema.xml
<schema name="simple" version="1.5">
<fields>
<!-- required -->
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="question" type="string" indexed="true" stored="true" required="true" />
<field name="answer" type="string" indexed="true" stored="true" required="true" />
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ms" type="string" indexed="true" stored="true" multiValued="true" />
<dynamicField name="*_t" type="string" indexed="true" stored="true" />
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_mi" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
</types>
</schema>
Details on how to access the credentials can be found here : https://www.ibm.com/watson/developercloud/doc/retrieve-rank/tutorial.shtml#credentials
To sum up, from the Bluemix web dashboard, if you click on your R&R service instance, the "Service Credentials" tab will show a username and password. These will not be your IBM ID username or password.
That said, if you've been able to create a cluster, that would suggest that you have got valid credentials. Are you sure that the cluster was created successfully? Can you confirm this by getting the cluster details using the curl command described at https://www.ibm.com/watson/developercloud/retrieve-and-rank/api/v1/?curl#list_solr_clusters ?
Dude, I met the same problem. Use the cranfield-solr-config.zip in Tutorial and replace its original config file (schema.xml...) with your config file. But do not uncompress the zip file and compress it again!!! I do not know why this happens, but it does...

Solr RELOAD changes/reverts schema changes

Steps I did:
curl -u cassandra "http://localhost:8983/solr/admin/cores?action=CREATE&name=tweets.tweets_test&generateResources=true&reindex=true&deleteAll=true"
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>
</types>
<fields>
<field indexed="true" multiValued="true" name="atnames" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="links" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="tweet_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="tweet" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="hashtags" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="uid" stored="true" type="TrieLongField"/>
<field indexed="true" multiValued="false" name="tweet_id" stored="true" type="TrieLongField"/>
</fields>
<uniqueKey>(uid,tweet_id)</uniqueKey>
</schema>
I would change the schema to (I want to index urls using KeywordTokenizerFactory):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TextField" name="TextFieldURL">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>
</types>
<fields>
<field indexed="true" multiValued="true" name="atnames" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="links" stored="true" type="TextFieldURL"/>
<field indexed="true" multiValued="false" name="tweet_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="tweet" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="hashtags" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="uid" stored="true" type="TrieLongField"/>
<field indexed="true" multiValued="false" name="tweet_id" stored="true" type="TrieLongField"/>
</fields>
<uniqueKey>(uid,tweet_id)</uniqueKey>
</schema>
Let's upload changes:
curl "http://localhost:8983/solr/resource/tweets.tweets_test/schema.xml" --data-binary #tweets.tweets_test.xml -H 'Content-type:text/xml; charset=utf-8'
Get the latest schema back to make sure it uploaded successfully:
http://localhost:8983/solr/tweets.tweets_test/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8
Looks good - I see my changes. (Btw, the changes that I did do not work, the links are still being indexed like so: "t.co", "http", ... ; probably another discussion) So I try to reload:
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&name=tweets.tweets_test&reindex=true&deleteAll=true"
Get the latest schema back:
http://localhost:8983/solr/tweets.tweets_test/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8
Don't see any changes that I've uploaded, somehow the schema.xml is back to original.
Ideas?
Update: bug was solved in 4.6.6 and 4.7.0 -- DSP-5204
http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/RNdse46.html?scroll=RNdse46__rel466
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/RNdse.html?scroll=RNdse__470ResIss

Solr exception due to schema

I have the following solr schema
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="testthing" version="1.5">
<fields>
<field name="_version_" type="long" indexed="true" stored="true" required="true"/>
<field name="doc_id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="title" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="doc_type" type="string" indexed="false" stored="true" required="true" multiValued="false"/>
<field name="description" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="allText" type="fs_text" indexed="true" stored="false" required="true" multiValued="true"/>
</fields>
<uniqueKey>doc_id</uniqueKey>
<copyField source="title" dest="allText" />
<copyField source="description" dest="allText" />
<dynamicField name="*" type="ignored" multiValued="true" />
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="fs_text" class="solr.TextField" positionIncrementGap="100"/>
</types>
</schema>
Solr complains about missing field text at dynamic field type
1898 [main] INFO org.apache.solr.servlet.SolrDispatchFilter ? SolrDispatchFilter.init() done
1918 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore ? org.apache.solr.common.SolrException: undefined field text at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1235)
however, my one and only dynamic field (ignore all not matched) doesn't use text type (it's type=ignore).
What am I missing here?
** so far, renaming the allText to text pretty much fixed the issue but I can't figure out why! Is there something special/predefined about text in Solr 4.1 ?
It is not about field type "text". It is about field named "text".
<defaultSearchField>text</defaultSearchField>
You may have changed or remove the default field in config. If this fixes the issue, then you know somewhere in the configuration you're referring to "text" field, possibly in solrconfig.xml as suggested in

Spring Data Solr geoqueries

I've just started to play a little bit around Solr and managed to get it running within a Tomcat servlet container. I would like now to use the repository approach from Spring Data but got stucked when trying to handle lat/lon fields (i.e.: geospatial data). I would like to store some tweet-like data. This is the schema I am currently using (trying to follow the wiki):
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="tweets" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text1" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HunspellStemFilterFactory"
dictionary="../../dictionaries/es_ANY.dic"
affix="../../dictionaries/es_ANY.aff"
ignoreCase="true" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text2" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<fieldType name="date" class="solr.DateField"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="username" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="pictureURL" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="topic" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="content" type="text1" indexed="true" stored="true"/>
<field name="hashtags" type="text2" indexed="true" stored="true"/>
<field name="geo" type="location" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>id</defaultSearchField>
</schema>
This would work fine without the geo field, which I don't know how to map in my POJO (I tried both using double[] like MongoDB and String in geo field without much success):
public class Tweet {
#Id
#Field
private String id;
#Field
private String username;
#Field
private String pictureURL;
#Field
private String topic;
#Field
private String content;
#Field
private List<String> hashtags;
#Field
private String geo;
#Field
private Date timestamp;
/** Getters/setters omitted **/
}
When mapping the geo field as a simple String ([lat],[lng]) the exception thrown is:
org.springframework.data.solr.UncategorizedSolrException: undefined field: "geo_0_coordinate"; nested exception is org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: undefined field: "geo_0_coordinate"
I tried having a look at the project tests but did not find any POJO using geo fields.
Any idea on how to proceed?
Thanks!
I finally found a solution. First of all, the geo field should be a GeoLocation:
#Field
private GeoLocation geo;
Another change required takes place in the schema.xml file:
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="double" class="solr.DoubleField"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<!-- ... -->
<field name="geo" type="location" indexed="true" stored="true"/>
<field name="geo_0_coordinate" type="double" indexed="true" stored="true" />
<field name="geo_1_coordinate" type="double" indexed="true" stored="true" />
It turns out Solr stores the LatLonTypes internally as a pair of doubles which should be also defined in the schema.
Hope this helps someone else!

SOLR 4.0 alphabetical sorting trouble

I'm having a hard time of getting my head around an issue I have with my SOLR address database.
I built this one up from the example files. I'm basically running the example configuration with a modified schema.
schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />
<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
I am populating the database by pushing about 20.000 random test datasets like the following to post.jar:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
<doc>
<field name="id">1352498443_1</field>
<field name="givenname_s">Aynur</field>
<field name="middleinitial_s"/>
<field name="surname_s">Lehnen</field>
<field name="gender_s">F</field>
<field name="pictureuri_s">dummy_assets/female.jpg</field>
<field name="function_s">Zugschaffner/in</field>
<field name="organizationalunit_s">P 07</field>
<field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
<field name="company_s">Lorem Lagna Epsum Emet</field>
<field name="street_s">Erlenweg</field>
<field name="streetnumber_s">82</field>
<field name="postcode_s">76297</field>
<field name="city_s">Lübeck</field>
<field name="building_s"/>
<field name="roomnumber_s">242</field>
<field name="country_s">GERMANY</field>
<field name="countrycode_s">DE</field>
<field name="emailaddress_s">aynur.lehnen#lorem-lagna-epsum-emet.de</field>
<field name="phone1_s">0392984823</field>
<field name="phone2_s">0124111417</field>
<field name="mobile_s">0325117132</field>
<field name="fax_s">0171459177</field>
</doc>
</add>
However when retreiving data I seem to have problems with alphabetical sorting. Consider the folowing query:
{
"responseHeader": {
"status": 0,
"QTime": 5,
"params": {
"sort": "surname_s asc",
"fl": "surname_s",
"indent": "true",
"wt": "json",
"q": "city_s:berlin"
}
},
"response": {
"numFound": 1094,
"start": 0,
"docs": [{
"surname_s": "Weil"
}, {
"surname_s": "Abel"
}, {
"surname_s": "Adam"
}, {
"surname_s": "Ade"
}, {
"surname_s": "Adrian"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Albers"
}]
}
}
Why is "Weil" on position one, while the rest of the data appears to be sorted correctly?
I believe that some of the additional analyzers that are being applied in the text_de field type are the cause for this sorting behavior. In my experience, for the best results when sorting strings is to use the alphaOlySort fieldType that comes with the example schema.xml shown below.
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of characters
matching a pattern with an arbitrary replacement string,
which may include back references to portions of the original
string matched by the pattern.
See the Java Regular Expression documentation for more
information on pattern and replacement string syntax.
http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
-->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
I would recommend creating a new field and then copying the value from surname_s via copyField, something like the following:
<field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />
<copyField source="surname_s" dest="surname_s_sort"/>
Note: there is not any need to store the value in the surname_s_sort field, hence the stored="false" attribute, unless you expect to display that to the users.
Then you can just change your query to sort on the surname_s_sort instead.
Sorting doesn't work well on multivalued and tokenized fields.
Documentation -
Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)
Use string as the field type and copy the title field into the new field.
<field name="surname_s_sort" type="string" indexed="true" stored="false"/>
<copyField source="surname_s" dest="surname_s_sort" />
As #Paige answered you can have keyword tokenizer, lower case filters which do not tokenize the field.
I had similiar issues and I tried the alphaOnlySort. This work for some part, but it starts messing up the sort results when the field contains values like -,/ spaces etc.
So the result was something like
/ abc
aa
/ abc2
So I ended up using the field type lowercase. It was already there so I figured that its a default type. I did use the copy field construction, so my final config was:
<schema>
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fields>
<field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
</fields>
<copyField source="job_name" dest="job_name_sort"/>
</schema>

Resources