Solr RELOAD changes/reverts schema changes - solr

Steps I did:
curl -u cassandra "http://localhost:8983/solr/admin/cores?action=CREATE&name=tweets.tweets_test&generateResources=true&reindex=true&deleteAll=true"
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>
</types>
<fields>
<field indexed="true" multiValued="true" name="atnames" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="links" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="tweet_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="tweet" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="hashtags" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="uid" stored="true" type="TrieLongField"/>
<field indexed="true" multiValued="false" name="tweet_id" stored="true" type="TrieLongField"/>
</fields>
<uniqueKey>(uid,tweet_id)</uniqueKey>
</schema>
I would change the schema to (I want to index urls using KeywordTokenizerFactory):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TextField" name="TextFieldURL">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>
</types>
<fields>
<field indexed="true" multiValued="true" name="atnames" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="links" stored="true" type="TextFieldURL"/>
<field indexed="true" multiValued="false" name="tweet_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="tweet" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="hashtags" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="uid" stored="true" type="TrieLongField"/>
<field indexed="true" multiValued="false" name="tweet_id" stored="true" type="TrieLongField"/>
</fields>
<uniqueKey>(uid,tweet_id)</uniqueKey>
</schema>
Let's upload changes:
curl "http://localhost:8983/solr/resource/tweets.tweets_test/schema.xml" --data-binary #tweets.tweets_test.xml -H 'Content-type:text/xml; charset=utf-8'
Get the latest schema back to make sure it uploaded successfully:
http://localhost:8983/solr/tweets.tweets_test/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8
Looks good - I see my changes. (Btw, the changes that I did do not work, the links are still being indexed like so: "t.co", "http", ... ; probably another discussion) So I try to reload:
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&name=tweets.tweets_test&reindex=true&deleteAll=true"
Get the latest schema back:
http://localhost:8983/solr/tweets.tweets_test/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8
Don't see any changes that I've uploaded, somehow the schema.xml is back to original.
Ideas?

Update: bug was solved in 4.6.6 and 4.7.0 -- DSP-5204
http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/RNdse46.html?scroll=RNdse46__rel466
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/RNdse.html?scroll=RNdse__470ResIss

Related

Access Denied trying to create Solr Config

I'm following the example at:
https://github.com/watson-developer-cloud/node-sdk/blob/master/examples/retrieve_and_rank_solr.v1.js
But everytime I try and upload a config I get
"Error: Unauthorized: Access is denied due to invalid credentials."
I've made an API key for Retrieve and Rank, are there more things to do to manage the credentials for R&R?
Here's my code:
return retrieveInstance.uploadConfigAsync({
cluster_id: clusterId,
config_name: watsonConfig.config_name,
config_zip_path: (__dirname + "/../../" + watsonConfig.config_path)
});
I'm successfully creating a cluster with this API key.
Schema.zip has this schema.xml
<schema name="simple" version="1.5">
<fields>
<!-- required -->
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="question" type="string" indexed="true" stored="true" required="true" />
<field name="answer" type="string" indexed="true" stored="true" required="true" />
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ms" type="string" indexed="true" stored="true" multiValued="true" />
<dynamicField name="*_t" type="string" indexed="true" stored="true" />
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_mi" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
</types>
</schema>
Details on how to access the credentials can be found here : https://www.ibm.com/watson/developercloud/doc/retrieve-rank/tutorial.shtml#credentials
To sum up, from the Bluemix web dashboard, if you click on your R&R service instance, the "Service Credentials" tab will show a username and password. These will not be your IBM ID username or password.
That said, if you've been able to create a cluster, that would suggest that you have got valid credentials. Are you sure that the cluster was created successfully? Can you confirm this by getting the cluster details using the curl command described at https://www.ibm.com/watson/developercloud/retrieve-and-rank/api/v1/?curl#list_solr_clusters ?
Dude, I met the same problem. Use the cranfield-solr-config.zip in Tutorial and replace its original config file (schema.xml...) with your config file. But do not uncompress the zip file and compress it again!!! I do not know why this happens, but it does...

Apache Solr Facet Search with Space

I am new to Solr Facet Search. I am searching some data using Apache Solr search, I had used Facet for some column to get the count but if there is a space or special character in that field it has been taken into count separately. I had used the solution in this link Apache Solr facet search exclude space to avoid space but still my problem persists
My altered Schema.XML file after seeing the above link is
<schema name="solr_quickstart" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_not_tokenized" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="int" class="solr.TrieIntField"/>
<fieldType name="UUIDField" class="solr.UUIDField"/>
</types>
<fields>
<field name="id" type="UUIDField" indexed="true" stored="true"/>
<field name="caseid" type="int" indexed="true" stored="true"/>
<field name="casenumber" type="text" indexed="true" stored="true"/>
<field name="casestatus" type="text" indexed="true" stored="true"/>
<field name="casetype" type="text" indexed="true" stored="true"/>
<field name="closeddate" type="text" indexed="true" stored="true"/>
<field name="courtname" type="text" indexed="true" stored="true"/>
<field name="courtabbr" type="text" indexed="true" stored="true"/>
<field name="fileddate" type="text" indexed="true" stored="true"/>
<field name="judgename" type="text" indexed="true" stored="true"/>
<field name="lastupdated" type="text" indexed="true" stored="true"/>
<field name="maindefendant" type="text" indexed="true" stored="true"/>
<field name="mainplaintiff" type="string" indexed="true" stored="true"/>
<field name="all" type="string" docValues="true" indexed="true" stored="false" multiValued="true"/>
</fields>
<defaultSearchField>casenumber</defaultSearchField>
<uniqueKey>id</uniqueKey>
<copyField source="casenumber" dest="all"/>
<copyField source="casestatus" dest="all"/>
<copyField source="casetype" dest="all"/>
<copyField source="courtname" dest="all"/>
<copyField source="courtabbr" dest="all"/>
<copyField source="judgename" dest="all"/>
<copyField source="maindefendant" dest="all"/>
<copyField source="mainplaintiff" dest="all"/>
</schema>
kindly anyone guide me in the right way of configuring my Schema.XML file
Your problem is the tokenizer.
This splits the field-value into different terms and every term get it's own count in facet queries. To avoid this, you could remove the tokenizer (ore use an other tokenizer). The result will be, that the whole field will be one term. This is a problem, if you have mar than one "subject" in your textfield.
I had an equal problem and tried to use the protected words, wich will not be applied on the tokenizer. It's more (only?) for stemming: solr not tokenizing protected words

SOLR performance

I am using SolrJ + Solr in my project.
The problem is that I faced unclear bottleneck regarding Solr/Jetty
Using jvisualvm I connected to JVM instance under which Solr launched and saw that 77% of time spent in method "org.eclipse.jetty.io.ByteArrayBuffer.readFrom()", stacktrace of one of threads is below:
"qtp64700533-36718" - Thread t#36718
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1040)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
So, it may looks OK that time spent on I/O, but:
application, which doing query launched on local machine (so I/O time should not be big, and thread state "RUNNABLE" in above stacktrace seems suspicious)
query response times may have up to 5-10 seconds
Load average on machine (CentOS) is about 10
Any help/advices appreciated, thanks!
UPD:
Indeed, guys, I forgot to give addtional info. Here it is:
hardware: i3770, 32gb ram, according to iotop it shows 50-600kb/sec read, 200-1000kb/sec write (almost most relates to SOLR process)
OS: Centos 6.6
java: OpenJDK 64-Bit Server VM (1.7.0_71 24.65-b04)
solr: 4.9.0 (launched with -Xmx=24000, but I think should split SOLR cores to separare JVM SOLR instances to minimize GC time)
solrj: 4.10.3, adding/updating/removing documents done with commitWithIn=10000 msec in java code.
about schemas: I am storing in SOLR data (ads + objects) regarding 5 countries: UA, RU, PL, BY, KZ.
So, there are 2 cores for each country, for example for Ukraine: ua_ads and ua_objects (10 cores in total)
Schemas between countries almost indentical, see below for Ukraine
"ua_ads" schema (should rename it from "example" though :) )
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<uniqueKey>adId</uniqueKey>
<field name="adId" type="long" indexed="true" stored="true" required="true"/>
<field name="objectId" type="long" indexed="true" stored="true" required="false"/>
<field name="url" type="string" indexed="false" stored="true" required="true"/>
<field name="regionId" type="int" indexed="false" stored="true" required="true"/>
<field name="sourceId" type="int" indexed="false" stored="true" required="true"/>
<field name="type" type="int" indexed="false" stored="true" required="true"/>
<field name="title" type="text_ru" indexed="false" stored="true" required="true"/>
<field name="address" type="text_ru" indexed="false" stored="true" required="true"/>
<field name="text" type="text_ru" indexed="false" stored="true" required="true"/>
<field name="dateFound" type="tdate" indexed="true" stored="true" required="true"/>
<!-- should be a string field (not int) to avoid cutting zero at beginning of phone number -->
<field name="phoneNumbers" type="string" indexed="true" stored="true" required="true" multiValued="true"/>
<field name="priceLocal" type="long" indexed="false" stored="true" required="false"/>
<field name="priceUsd" type="long" indexed="false" stored="true" required="false"/>
<field name="currency" type="int" indexed="false" stored="true" required="false"/>
<field name="roomsCount" type="int" indexed="false" stored="true" required="false"/>
<field name="area" type="int" indexed="false" stored="true" required="false"/>
<field name="imagesCount" type="int" indexed="true" stored="true" required="true"/>
</schema>
"ua_objects" schema
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="addr_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<!-- no stemming for address, dots must me followed by space: "г. Киев" -->
<!-- char filters is always firs (preprocessing) -->
<charFilter class="solr.MappingCharFilterFactory" mapping="lang/chars_replacement.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- replacing all except letters, removing "-" in home address (9-А) -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^0-9abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюяіїє\-]" replacement="" replace="all"/>
<!-- replacing all except letters, removing "-" in home address ("9-а" => "9а") -->
<filter class="solr.PatternReplaceFilterFactory" pattern="(\d{1,3})[\- ]([абвгдеёжзийклмнопрстуфхцчшщ])" replacement="$1$2" replace="all"/>
<filter class="solr.SynonymFilterFactory" ignoreCase="true" synonyms="lang/cities_ukr2rus.txt"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ї" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="і" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="й" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ё" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="є" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="э" replacement="е" replace="all"/>
<!-- 1-length is for case with home letters: "Хрещатик, 3" -->
<filter class="solr.LengthFilterFactory" min="1" max="64"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt,lang/stopwords_addr.txt" format="snowball"/>
</analyzer>
</fieldType>
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<!-- dots must me followed by space: "г. Киев" -->
<!-- char filters is always firs (preprocessing) -->
<charFilter class="solr.MappingCharFilterFactory" mapping="lang/chars_replacement.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^0-9abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюяіїє\-]" replacement="" replace="all"/>
<!-- replacing all except letters, removing "-" in home address ("9-а" => "9а") -->
<filter class="solr.PatternReplaceFilterFactory" pattern="(\d{1,3})[\- ]([абвгдеёжзийклмнопрстуфхцчшщ])" replacement="$1$2" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ї" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="і" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="й" replacement="и" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="ё" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="є" replacement="е" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="э" replacement="е" replace="all"/>
<filter class="solr.LengthFilterFactory" min="1" max="64"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball"/>
<filter class="solr.SynonymFilterFactory" ignoreCase="true" synonyms="lang/synonyms.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
</analyzer>
</fieldType>
<field name="_version_" type="long" indexed="true" stored="true"/>
<uniqueKey>objectId</uniqueKey>
<field name="objectId" type="long" indexed="true" stored="true" required="true"/>
<field name="url" type="string" indexed="false" stored="true" required="true"/>
<field name="regionId" type="int" indexed="true" stored="true" required="true"/>
<field name="sourceId" type="int" indexed="false" stored="true" required="true"/>
<field name="type" type="int" indexed="true" stored="true" required="true"/>
<field name="address" type="addr_ru" indexed="true" stored="true" required="true"/>
<field name="title" type="text_ru" indexed="true" stored="true" required="true"/>
<field name="text" type="text_ru" indexed="true" stored="true" required="true"/>
<field name="dateFound" type="tdate" indexed="true" stored="true" required="true"/>
<!-- should be a string field (not int) to avoid cutting zero at beginning of phone number -->
<field name="phoneNumbers" type="string" indexed="true" stored="true" required="true" multiValued="true"/>
<field name="ownerDetected" type="boolean" indexed="true" stored="true" required="true"/>
<field name="priceUsd" type="long" indexed="true" stored="true" required="false"/>
<field name="priceLocal" type="long" indexed="false" stored="true" required="false"/>
<field name="currency" type="int" indexed="false" stored="true" required="false"/>
<field name="roomsCount" type="int" indexed="true" stored="true" required="false"/>
<field name="area" type="int" indexed="true" stored="true" required="false"/>
<field name="dateUpdated" type="tdate" indexed="true" stored="true" required="true"/>
<field name="dateClosed" type="tdate" indexed="true" stored="true" required="false"/>
<field name="m2priceRel" type="float" indexed="true" stored="true" required="false"/>
<field name="ceddData" type="binary" indexed="false" stored="true" required="false" multiValued="true"/>
<field name="imagesCount" type="int" indexed="true" stored="true" required="true"/>
<field name="uniqAdTexts" type="string" indexed="false" stored="true" required="true" multiValued="true"/>
</schema>
biggest indexes:
ru_ads: 2.99gb
ru_objects: 3.25gb
ua_ads: 5.45gb
ua_objects: 2.36gb
other cores indexes relatively small
queries which runs too long ("too long" from client-side) looks like this one (took from SOLR log, "????" is just non-english letters)
400723188 [qtp64700533-40547] INFO org.apache.solr.core.SolrCore ? [ua-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(??????\+????????\+???????\+????????)+AND+type:3+AND+regionId:2+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[2+TO+2])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[40+TO+60])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[23500+TO+70500])+AND+dateUpdated:[2014-12-09T10:23:07Z+TO+2015-01-28T10:23:07Z]+AND+-objectId:(27824841)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=18 status=0 QTime=287
401989528 [qtp64700533-40830] INFO org.apache.solr.core.SolrCore ? [ru-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(?????????????\+??????)+AND+type:4+AND+regionId:162+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[1+TO+1])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[40+TO+58])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[9+TO+27])+AND+dateUpdated:[2014-12-09T10:44:08Z+TO+2015-01-28T10:44:08Z]+AND+-objectId:(26415616)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=820 status=0 QTime=5755
400832723 [qtp64700533-40322] INFO org.apache.solr.core.SolrCore ? [ru-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(????????\+???????)+AND+type:4+AND+regionId:102+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[1+TO+1])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[31+TO+45])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[115+TO+343])+AND+dateUpdated:[2014-12-09T10:24:57Z+TO+2015-01-28T10:24:57Z]+AND+-objectId:(26415342)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=9 status=0 QTime=372
402069370 [qtp64700533-40832] INFO org.apache.solr.core.SolrCore ? [ru-objects] webapp=/solr path=/select params={mm=1&fl=*&start=0&q=(????????\+?????????\+??\+????????)+AND+type:3+AND+regionId:135+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[1+TO+1])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[28+TO+40])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[9529+TO+28585])+AND+dateUpdated:[2014-10-30T10:45:33Z+TO+2015-01-28T10:45:33Z]+AND+-objectId:(26415855)&qf=address^20+title^2+text&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=14075 status=0 QTime=544
401805198 [qtp64700533-40233] INFO org.apache.solr.core.SolrCore ? [ua-objects] webapp=/solr path=/select params={mm=2&fl=*&start=0&q=(??????\+??\+??????\+?????\+??????????)+AND+type:3+AND+regionId:16+AND+((*:*+AND+-roomsCount:[*+TO+*])+OR+roomsCount:[3+TO+3])+AND+((*:*+AND+-area:[*+TO+*])+OR+area:[93+TO+95])+AND+((*:*+AND+-priceUsd:[*+TO+*])+OR+priceUsd:[284050+TO+313950])+AND+dateUpdated:[2015-01-08T10:41:09Z+TO+2015-01-28T10:41:09Z]+AND+-objectId:(27826334)&qf=address^20+title^2&wt=javabin&version=2&defType=edismax&rows=2147483647} hits=6 status=0 QTime=462
here is fresh profiling screenshot from jvisualvm
part of "top" command, delay=10sec
You have given the parameter rows=2147483647 in every of your queries. The meaning of this parameter is (taken from the reference)
You can use the rows parameter to paginate results from a query. The
parameter specifies the maximum number of documents from the complete
result set that Solr should return to the client at one time.
The default value is 10. That is, by default, Solr returns 10
documents at a time in response to a query.
So you are telling Solr in effect to send all hits found for a query in a single response. This is the reason for your bad performance.
Does google send you all 500.000.000 hits found when querying for "java", no. Why not, performance. Each and every IR application I know gives you a small page with the first results so that a search performs well.
This is the reason for your high I/O, solr fetches the records from the disk and writes them to the response. This is I/O, nothing more, nothing less.
Since you are using this for analytics and want to extract everything matching, you should look into the new streaming export feature. Unfortunately, it is only available in Solr 4.10.
You can also update to SSD - it is very good boost for Solr performance.
Finally, review your cache levels. If you don't update frequently and some of the caches are full, you could increase the defaults. If you do update frequently, it's not as beneficial as caches are invalidated on commits.

Spring Data Solr geoqueries

I've just started to play a little bit around Solr and managed to get it running within a Tomcat servlet container. I would like now to use the repository approach from Spring Data but got stucked when trying to handle lat/lon fields (i.e.: geospatial data). I would like to store some tweet-like data. This is the schema I am currently using (trying to follow the wiki):
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="tweets" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text1" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HunspellStemFilterFactory"
dictionary="../../dictionaries/es_ANY.dic"
affix="../../dictionaries/es_ANY.aff"
ignoreCase="true" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text2" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<fieldType name="date" class="solr.DateField"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="username" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="pictureURL" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="topic" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="content" type="text1" indexed="true" stored="true"/>
<field name="hashtags" type="text2" indexed="true" stored="true"/>
<field name="geo" type="location" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>id</defaultSearchField>
</schema>
This would work fine without the geo field, which I don't know how to map in my POJO (I tried both using double[] like MongoDB and String in geo field without much success):
public class Tweet {
#Id
#Field
private String id;
#Field
private String username;
#Field
private String pictureURL;
#Field
private String topic;
#Field
private String content;
#Field
private List<String> hashtags;
#Field
private String geo;
#Field
private Date timestamp;
/** Getters/setters omitted **/
}
When mapping the geo field as a simple String ([lat],[lng]) the exception thrown is:
org.springframework.data.solr.UncategorizedSolrException: undefined field: "geo_0_coordinate"; nested exception is org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: undefined field: "geo_0_coordinate"
I tried having a look at the project tests but did not find any POJO using geo fields.
Any idea on how to proceed?
Thanks!
I finally found a solution. First of all, the geo field should be a GeoLocation:
#Field
private GeoLocation geo;
Another change required takes place in the schema.xml file:
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="double" class="solr.DoubleField"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<!-- ... -->
<field name="geo" type="location" indexed="true" stored="true"/>
<field name="geo_0_coordinate" type="double" indexed="true" stored="true" />
<field name="geo_1_coordinate" type="double" indexed="true" stored="true" />
It turns out Solr stores the LatLonTypes internally as a pair of doubles which should be also defined in the schema.
Hope this helps someone else!

Unable to see data in a string field in apache solr 3.6 when importing data from mysql

All the other fields have the imported data but I dont see company_logo field when I search all the results i.e. : , here is my data config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/demandfire"
user="root" password=""/>
<document>
<entity name="core4"
query="select company.company_name,demand.id,demand.issue,
demand.suggestion,demand.title,demand.company_id,company.logo from company,demand
where demand.company_id = company.id;
">
<field column="demand.id" name="id"/>
<field column="demand.issue" name="issue"/>
<field column="demand.suggestion" name="suggestion"/>
<field column="demand.title" name="title"/>
<field column="demand.company_id" name="company_id"/>
<field column="company.company_name" name="company_name"/>
<field column="company.logo" name="company_logo"/>
</entity>
</document>
</dataConfig>
The following is my schema file, the problem comes in the field company_logo, I have mapped it correctly in the data-config file, all the other fields are able to get data but this field cant, the sample entry in this(logo) field of mysql table is of this type '6bf38f4e-a9af-40b8-af04-2b90d3c93f1f.jpg'
<schema name="example core one" version="1.1">
<types>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="string_lowercase" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="issue" type="string_lowercase" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="suggestion" type="string_lowercase" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="title" type="string_lowercase" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="company_id" type="string" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="company_name" type="string_lowercase" indexed = "true" stored="true" multivalued="false" required = "true"/>
<field name="company_logo" type="string" indexed = "true" stored="true" multivalued="false" required = "false"/>
</fields>
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>title</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
</schema>*
Try to remove the required= "false" attribute in your field (it should be the default anyway).
Try to change the definition of your field with this:
<field name="company_logo" type="string" indexed = "true" stored="true" multivalued="false"/>

Resources