Spring Data Solr geoqueries - solr

I've just started to play a little bit around Solr and managed to get it running within a Tomcat servlet container. I would like now to use the repository approach from Spring Data but got stucked when trying to handle lat/lon fields (i.e.: geospatial data). I would like to store some tweet-like data. This is the schema I am currently using (trying to follow the wiki):
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="tweets" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text1" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HunspellStemFilterFactory"
dictionary="../../dictionaries/es_ANY.dic"
affix="../../dictionaries/es_ANY.aff"
ignoreCase="true" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text2" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<fieldType name="date" class="solr.DateField"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="username" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="pictureURL" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="topic" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="content" type="text1" indexed="true" stored="true"/>
<field name="hashtags" type="text2" indexed="true" stored="true"/>
<field name="geo" type="location" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>id</defaultSearchField>
</schema>
This would work fine without the geo field, which I don't know how to map in my POJO (I tried both using double[] like MongoDB and String in geo field without much success):
public class Tweet {
#Id
#Field
private String id;
#Field
private String username;
#Field
private String pictureURL;
#Field
private String topic;
#Field
private String content;
#Field
private List<String> hashtags;
#Field
private String geo;
#Field
private Date timestamp;
/** Getters/setters omitted **/
}
When mapping the geo field as a simple String ([lat],[lng]) the exception thrown is:
org.springframework.data.solr.UncategorizedSolrException: undefined field: "geo_0_coordinate"; nested exception is org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: undefined field: "geo_0_coordinate"
I tried having a look at the project tests but did not find any POJO using geo fields.
Any idea on how to proceed?
Thanks!

I finally found a solution. First of all, the geo field should be a GeoLocation:
#Field
private GeoLocation geo;
Another change required takes place in the schema.xml file:
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="double" class="solr.DoubleField"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<!-- ... -->
<field name="geo" type="location" indexed="true" stored="true"/>
<field name="geo_0_coordinate" type="double" indexed="true" stored="true" />
<field name="geo_1_coordinate" type="double" indexed="true" stored="true" />
It turns out Solr stores the LatLonTypes internally as a pair of doubles which should be also defined in the schema.
Hope this helps someone else!

Related

Solr combining exact match and likely match on single text field not working

I am trying to perform likely search on full-name fields and exact match on office-no,mobile-number,house-no,other-phone-number fields .All these i have copied to Text field "full-search-all" so that i can configure into website for a single text box where users can search for full-name like Kat should return Katric and if they give exact mobile number as 123456789 on same text field should return exact match result. Either one(exact match on mobile,office,house numbers OR likely match on full-name) working for my "full-search-all" field when i perform search.Both of them not working on full-search-all field in solrAdmin. I am Stanadard Query Parser.
I have placed my schema.xml file which i have created for my search.
Please can you pointout where is the wrong in Schema.xml file . Both search won't be searchable on single text field?
Complete schema.xml file below
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="dynamic" version="1.5">
<types>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="search" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="exactstring" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="long" class="solr.TrieLongField" />
</types>
<fields>
<!-- The _version_ field is required when using the Solr update log or SolrCloud (cfr. SOLR-3432) -->
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="full-search-all" type="search" indexed="true" stored="false" multiValued="true" />
<field name="phone-number" type="exactstring" indexed="true" stored="false" multiValued="true" />
<!-- Exact Match columns -->
<copyField source="mobile-number" dest="phone-number" />
<copyField source="house-no" dest="phone-number" />
<copyField source="office-no" dest="phone-number" />
<copyField source="other-phone-number" dest="phone-number" />
<copyField source="mobile-number" dest="full-search-all" />
<copyField source="house-no" dest="full-search-all" />
<copyField source="office-no" dest="full-search-all" />
<copyField source="other-phone-number" dest="full-search-all" />
<copyField source="full-name" dest="full-search-all" />
<!-- query fields -->
<field name="application-id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="full-name" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile-number" type="exactstring" indexed="true" stored="true" required="false" multiValued="false" />
<field name="house-no" type="exactstring" indexed="true" stored="true" required="false" multiValued="false" />
<field name="office-no" type="exactstring" indexed="true" stored="true" required="false" multiValued="false" />
<field name="other-phone-number" type="exactstring" indexed="true" stored="true" required="false" multiValued="false" />
<field name="campaign-name" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="reason" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
</fields>
<uniqueKey>application-id</uniqueKey>
</schema>
Field name only should consist of alphanumeric or underscore characters only and not start with a digit
The name of the field. Field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g. version) are reserved. Every field must have a name.
Most of your field name contain - character, remove the character.
Source : https://cwiki.apache.org/confluence/display/solr/Defining+Fields
Once you have copied field into full_search_all field, you can't separate them from that field. So if you want name to be prefix, phone to be exact search you can't do this with a single field.
Instead write a query analyzer, which will tell you on which which field to perform search.
For example : If a user write 123456789 (only numeric) on the text box, your query analyzer should return field to search is phone_number.
Query will be:
phone_number : 123456789
And if a user write ashraful (non numeric) on the text box your query analyzer should return full_name.
Query will be :
full_name : ashraful

Apache Solr Facet Search with Space

I am new to Solr Facet Search. I am searching some data using Apache Solr search, I had used Facet for some column to get the count but if there is a space or special character in that field it has been taken into count separately. I had used the solution in this link Apache Solr facet search exclude space to avoid space but still my problem persists
My altered Schema.XML file after seeing the above link is
<schema name="solr_quickstart" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_not_tokenized" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="int" class="solr.TrieIntField"/>
<fieldType name="UUIDField" class="solr.UUIDField"/>
</types>
<fields>
<field name="id" type="UUIDField" indexed="true" stored="true"/>
<field name="caseid" type="int" indexed="true" stored="true"/>
<field name="casenumber" type="text" indexed="true" stored="true"/>
<field name="casestatus" type="text" indexed="true" stored="true"/>
<field name="casetype" type="text" indexed="true" stored="true"/>
<field name="closeddate" type="text" indexed="true" stored="true"/>
<field name="courtname" type="text" indexed="true" stored="true"/>
<field name="courtabbr" type="text" indexed="true" stored="true"/>
<field name="fileddate" type="text" indexed="true" stored="true"/>
<field name="judgename" type="text" indexed="true" stored="true"/>
<field name="lastupdated" type="text" indexed="true" stored="true"/>
<field name="maindefendant" type="text" indexed="true" stored="true"/>
<field name="mainplaintiff" type="string" indexed="true" stored="true"/>
<field name="all" type="string" docValues="true" indexed="true" stored="false" multiValued="true"/>
</fields>
<defaultSearchField>casenumber</defaultSearchField>
<uniqueKey>id</uniqueKey>
<copyField source="casenumber" dest="all"/>
<copyField source="casestatus" dest="all"/>
<copyField source="casetype" dest="all"/>
<copyField source="courtname" dest="all"/>
<copyField source="courtabbr" dest="all"/>
<copyField source="judgename" dest="all"/>
<copyField source="maindefendant" dest="all"/>
<copyField source="mainplaintiff" dest="all"/>
</schema>
kindly anyone guide me in the right way of configuring my Schema.XML file
Your problem is the tokenizer.
This splits the field-value into different terms and every term get it's own count in facet queries. To avoid this, you could remove the tokenizer (ore use an other tokenizer). The result will be, that the whole field will be one term. This is a problem, if you have mar than one "subject" in your textfield.
I had an equal problem and tried to use the protected words, wich will not be applied on the tokenizer. It's more (only?) for stemming: solr not tokenizing protected words

Solr RELOAD changes/reverts schema changes

Steps I did:
curl -u cassandra "http://localhost:8983/solr/admin/cores?action=CREATE&name=tweets.tweets_test&generateResources=true&reindex=true&deleteAll=true"
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>
</types>
<fields>
<field indexed="true" multiValued="true" name="atnames" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="links" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="tweet_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="tweet" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="hashtags" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="uid" stored="true" type="TrieLongField"/>
<field indexed="true" multiValued="false" name="tweet_id" stored="true" type="TrieLongField"/>
</fields>
<uniqueKey>(uid,tweet_id)</uniqueKey>
</schema>
I would change the schema to (I want to index urls using KeywordTokenizerFactory):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TextField" name="TextFieldURL">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>
</types>
<fields>
<field indexed="true" multiValued="true" name="atnames" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="links" stored="true" type="TextFieldURL"/>
<field indexed="true" multiValued="false" name="tweet_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="tweet" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="hashtags" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="uid" stored="true" type="TrieLongField"/>
<field indexed="true" multiValued="false" name="tweet_id" stored="true" type="TrieLongField"/>
</fields>
<uniqueKey>(uid,tweet_id)</uniqueKey>
</schema>
Let's upload changes:
curl "http://localhost:8983/solr/resource/tweets.tweets_test/schema.xml" --data-binary #tweets.tweets_test.xml -H 'Content-type:text/xml; charset=utf-8'
Get the latest schema back to make sure it uploaded successfully:
http://localhost:8983/solr/tweets.tweets_test/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8
Looks good - I see my changes. (Btw, the changes that I did do not work, the links are still being indexed like so: "t.co", "http", ... ; probably another discussion) So I try to reload:
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&name=tweets.tweets_test&reindex=true&deleteAll=true"
Get the latest schema back:
http://localhost:8983/solr/tweets.tweets_test/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8
Don't see any changes that I've uploaded, somehow the schema.xml is back to original.
Ideas?
Update: bug was solved in 4.6.6 and 4.7.0 -- DSP-5204
http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/RNdse46.html?scroll=RNdse46__rel466
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/RNdse.html?scroll=RNdse__470ResIss

SOLR performance

I am using SolrJ + Solr in my project.
The problem is that I faced unclear bottleneck regarding Solr/Jetty
Using jvisualvm I connected to JVM instance under which Solr launched and saw that 77% of time spent in method "org.eclipse.jetty.io.ByteArrayBuffer.readFrom()", stacktrace of one of threads is below:
"qtp64700533-36718" - Thread t#36718
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1040)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
So, it may looks OK that time spent on I/O, but:
application, which doing query launched on local machine (so I/O time should not be big, and thread state "RUNNABLE" in above stacktrace seems suspicious)
query response times may have up to 5-10 seconds
Load average on machine (CentOS) is about 10
Any help/advices appreciated, thanks!
UPD:
Indeed, guys, I forgot to give addtional info. Here it is:
hardware: i3770, 32gb ram, according to iotop it shows 50-600kb/sec read, 200-1000kb/sec write (almost most relates to SOLR process)
OS: Centos 6.6
java: OpenJDK 64-Bit Server VM (1.7.0_71 24.65-b04)
solr: 4.9.0 (launched with -Xmx=24000, but I think should split SOLR cores to separare JVM SOLR instances to minimize GC time)
solrj: 4.10.3, adding/updating/removing documents done with commitWithIn=10000 msec in java code.
about schemas: I am storing in SOLR data (ads + objects) regarding 5 countries: UA, RU, PL, BY, KZ.
So, there are 2 cores for each country, for example for Ukraine: ua_ads and ua_objects (10 cores in total)
Schemas between countries almost indentical, see below for Ukraine
"ua_ads" schema (should rename it from "example" though :) )
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<uniqueKey>adId</uniqueKey>
<field name="adId" type="long" indexed="true" stored="true" required="true"/>
<field name="objectId" type="long" indexed="true" stored="true" required="false"/>
<field name="url" type="string" indexed="false" stored="true" required="true"/>
<field name="regionId" type="int" indexed="false" stored="true" required="true"/>
<field name="sourceId" type="int" indexed="false" stored="true" required="true"/>
<field name="type" type="int" indexed="false" stored="true" required="true"/>
<field name="title" type="text_ru" indexed="false" stored="true" required="true"/>
<field name="address" type="text_ru" indexed="false" stored="true" required="true"/>
<field name="text" type="text_ru" indexed="false" stored="true" required="true"/>
<field name="dateFound" type="tdate" indexed="true" stored="true" required="true"/>
<!-- should be a string field (not int) to avoid cutting zero at beginning of phone number -->
<field name="phoneNumbers" type="string" indexed="true" stored="true" required="true" multiValued="true"/>
<field name="priceLocal" type="long" indexed="false" stored="true" required="false"/>
<field name="priceUsd" type="long" indexed="false" stored="true" required="false"/>
<field name="currency" type="int" indexed="false" stored="true" required="false"/>
<field name="roomsCount" type="int" indexed="false" stored="true" required="false"/>
<field name="area" type="int" indexed="false" stored="true" required="false"/>
<field name="imagesCount" type="int" indexed="true" stored="true" required="true"/>
</schema>
"ua_objects" schema
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="addr_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<!-- no stemming for address, dots must me followed by space: "г. Киев" -->
<!-- char filters is always firs (preprocessing) -->
<charFilter class="solr.MappingCharFilterFactory" mapping="lang/chars_replacement.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- replacing all except letters, removing "-" in home address (9-А) -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^0-9abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюяіїє\-]" replacement="" replace="all"/>
<!-- replacing all except letters, removing "-" in home address ("9-а" => "9а") -->
<filter class="solr.PatternReplaceFilterFactory" pattern="(\d{1,3})[\- ]([абвгдеёжзийклмнопрстуфхцчшщ])" replacement="$1$2" replace="all"/>
<filter class="solr.SynonymFilterFactory" ignoreCase="true" synonyms="lang/cities_ukr2rus.txt"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ї" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="і" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="й" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ё" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="є" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="э" replacement="е" replace="all"/>
<!-- 1-length is for case with home letters: "Хрещатик, 3" -->
<filter class="solr.LengthFilterFactory" min="1" max="64"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt,lang/stopwords_addr.txt" format="snowball"/>
</analyzer>
</fieldType>
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<!-- dots must me followed by space: "г. Киев" -->
<!-- char filters is always firs (preprocessing) -->
<charFilter class="solr.MappingCharFilterFactory" mapping="lang/chars_replacement.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^0-9abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюяіїє\-]" replacement="" replace="all"/>
<!-- replacing all except letters, removing "-" in home address ("9-а" => "9а") -->
<filter class="solr.PatternReplaceFilterFactory" pattern="(\d{1,3})[\- ]([абвгдеёжзийклмнопрстуфхцчшщ])" replacement="$1$2" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ї" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="і" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="й" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ё" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="є" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="э" replacement="е" replace="all"/>
<filter class="solr.LengthFilterFactory" min="1" max="64"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball"/>
<filter class="solr.SynonymFilterFactory" ignoreCase="true" synonyms="lang/synonyms.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
</analyzer>
</fieldType>
<field name="_version_" type="long" indexed="true" stored="true"/>
<uniqueKey>objectId</uniqueKey>
<field name="objectId" type="long" indexed="true" stored="true" required="true"/>
<field name="url" type="string" indexed="false" stored="true" required="true"/>
<field name="regionId" type="int" indexed="true" stored="true" required="true"/>
<field name="sourceId" type="int" indexed="false" stored="true" required="true"/>
<field name="type" type="int" indexed="true" stored="true" required="true"/>
<field name="address" type="addr_ru" indexed="true" stored="true" required="true"/>
<field name="title" type="text_ru" indexed="true" stored="true" required="true"/>
<field name="text" type="text_ru" indexed="true" stored="true" required="true"/>
<field name="dateFound" type="tdate" indexed="true" stored="true" required="true"/>
<!-- should be a string field (not int) to avoid cutting zero at beginning of phone number -->
<field name="phoneNumbers" type="string" indexed="true" stored="true" required="true" multiValued="true"/>
<field name="ownerDetected" type="boolean" indexed="true" stored="true" required="true"/>
<field name="priceUsd" type="long" indexed="true" stored="true" required="false"/>
<field name="priceLocal" type="long" indexed="false" stored="true" required="false"/>
<field name="currency" type="int" indexed="false" stored="true" required="false"/>
<field name="roomsCount" type="int" indexed="true" stored="true" required="false"/>
<field name="area" type="int" indexed="true" stored="true" required="false"/>
<field name="dateUpdated" type="tdate" indexed="true" stored="true" required="true"/>
<field name="dateClosed" type="tdate" indexed="true" stored="true" required="false"/>
<field name="m2priceRel" type="float" indexed="true" stored="true" required="false"/>
<field name="ceddData" type="binary" indexed="false" stored="true" required="false" multiValued="true"/>
<field name="imagesCount" type="int" indexed="true" stored="true" required="true"/>
<field name="uniqAdTexts" type="string" indexed="false" stored="true" required="true" multiValued="true"/>
</schema>
biggest indexes:
ru_ads: 2.99gb
ru_objects: 3.25gb
ua_ads: 5.45gb
ua_objects: 2.36gb
other cores indexes relatively small
queries which runs too long ("too long" from client-side) looks like this one (took from SOLR log, "????" is just non-english letters)
400723188 [qtp64700533-40547] INFO org.apache.solr.core.SolrCore ? [ua-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(??????\+????????\+???????\+????????)+AND+type:3+AND+regionId:2+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[2+TO+2])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[40+TO+60])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[23500+TO+70500])+AND+dateUpdated:[2014-12-09T10:23:07Z+TO+2015-01-28T10:23:07Z]+AND+-objectId:(27824841)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=18 status=0 QTime=287
401989528 [qtp64700533-40830] INFO org.apache.solr.core.SolrCore ? [ru-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(?????????????\+??????)+AND+type:4+AND+regionId:162+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[1+TO+1])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[40+TO+58])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[9+TO+27])+AND+dateUpdated:[2014-12-09T10:44:08Z+TO+2015-01-28T10:44:08Z]+AND+-objectId:(26415616)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=820 status=0 QTime=5755
400832723 [qtp64700533-40322] INFO org.apache.solr.core.SolrCore ? [ru-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(????????\+???????)+AND+type:4+AND+regionId:102+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[1+TO+1])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[31+TO+45])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[115+TO+343])+AND+dateUpdated:[2014-12-09T10:24:57Z+TO+2015-01-28T10:24:57Z]+AND+-objectId:(26415342)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=9 status=0 QTime=372
402069370 [qtp64700533-40832] INFO org.apache.solr.core.SolrCore ? [ru-objects] webapp=/solr path=/select params={mm=1&fl=*&start=0&q=(????????\+?????????\+??\+????????)+AND+type:3+AND+regionId:135+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[1+TO+1])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[28+TO+40])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[9529+TO+28585])+AND+dateUpdated:[2014-10-30T10:45:33Z+TO+2015-01-28T10:45:33Z]+AND+-objectId:(26415855)&qf=address^20+title^2+text&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=14075 status=0 QTime=544
401805198 [qtp64700533-40233] INFO org.apache.solr.core.SolrCore ? [ua-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(??????\+??\+??????\+?????\+??????????)+AND+type:3+AND+regionId:16+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[3+TO+3])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[93+TO+95])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[284050+TO+313950])+AND+dateUpdated:[2015-01-08T10:41:09Z+TO+2015-01-28T10:41:09Z]+AND+-objectId:(27826334)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=6 status=0 QTime=462
here is fresh profiling screenshot from jvisualvm
part of "top" command, delay=10sec
You have given the parameter rows=2147483647 in every of your queries. The meaning of this parameter is (taken from the reference)
You can use the rows parameter to paginate results from a query. The
parameter specifies the maximum number of documents from the complete
result set that Solr should return to the client at one time.
The default value is 10. That is, by default, Solr returns 10
documents at a time in response to a query.
So you are telling Solr in effect to send all hits found for a query in a single response. This is the reason for your bad performance.
Does google send you all 500.000.000 hits found when querying for "java", no. Why not, performance. Each and every IR application I know gives you a small page with the first results so that a search performs well.
This is the reason for your high I/O, solr fetches the records from the disk and writes them to the response. This is I/O, nothing more, nothing less.
Since you are using this for analytics and want to extract everything matching, you should look into the new streaming export feature. Unfortunately, it is only available in Solr 4.10.
You can also update to SSD - it is very good boost for Solr performance.
Finally, review your cache levels. If you don't update frequently and some of the caches are full, you could increase the defaults. If you do update frequently, it's not as beneficial as caches are invalidated on commits.

Unable to see data in a string field in apache solr 3.6 when importing data from mysql

All the other fields have the imported data but I dont see company_logo field when I search all the results i.e. : , here is my data config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/demandfire"
user="root" password=""/>
<document>
<entity name="core4"
query="select company.company_name,demand.id,demand.issue,
demand.suggestion,demand.title,demand.company_id,company.logo from company,demand
where demand.company_id = company.id;
">
<field column="demand.id" name="id"/>
<field column="demand.issue" name="issue"/>
<field column="demand.suggestion" name="suggestion"/>
<field column="demand.title" name="title"/>
<field column="demand.company_id" name="company_id"/>
<field column="company.company_name" name="company_name"/>
<field column="company.logo" name="company_logo"/>
</entity>
</document>
</dataConfig>
The following is my schema file, the problem comes in the field company_logo, I have mapped it correctly in the data-config file, all the other fields are able to get data but this field cant, the sample entry in this(logo) field of mysql table is of this type '6bf38f4e-a9af-40b8-af04-2b90d3c93f1f.jpg'
<schema name="example core one" version="1.1">
<types>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="string_lowercase" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="issue" type="string_lowercase" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="suggestion" type="string_lowercase" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="title" type="string_lowercase" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="company_id" type="string" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="company_name" type="string_lowercase" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="company_logo" type="string" indexed = "true" stored="true" multivalued="false" required = "false"/>
</fields>
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>title</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
</schema>*
Try to remove the required= "false" attribute in your field (it should be the default anyway).
Try to change the definition of your field with this:
<field name="company_logo" type="string" indexed = "true" stored="true" multivalued="false"/>

Resources