Solr data beign indexed in all server[Sharding Mode]

Solr data beign indexed in all server[Sharding Mode] - solr

I created three Solr cloud instances for sharding data across three instances and querying from those three instances. I created them using below commands
CMD:
solr.cmd start -c -s Node1 -p 8983
solr.cmd start -c -s Node2 -z localhost:9983 -p 8984
solr.cmd start -c -s Node3 -z localhost:9983 -p 8985
Then I created a collection which uses three shards and has a replication factor of 1.
CMD1:
solr.cmd create_collection -c tests -shards 3 replicationFactor 1
Then I index data into the collection using post jar using following command.
CMD2:
java -jar post.jar *.xml
There was 32 XML files in that location
As per my understanding the data will be split and indexed on all on the three Solr cloud instance.
But what happened was 32 document was indexed on all the three instances.
I confirmed this by using following URLs
http://localhost:8984/solr/tests/select?indent=on&q=*:*&wt=json
http://localhost:8985/solr/tests/select?indent=on&q=*:*&wt=json
http://localhost:8983/solr/tests/select?indent=on&q=*:*&wt=json
Everything returned the same number of records.
And my understanding is the document will be split and indexed on all the three instances.
Since I want to index 3 billion documents into Solr and there is 2 billion hard limit in Solr. I wanted to make sure the they are splited and indexed in the three Solr instances.
let me know if have made any mistakes.
Versions.
Solr =6.1.0
Windows= 7

When you're querying /solr/tests, you're querying the tests collection. Behind the scenes Solr is fetching all the documents in that collection and returning them for you, from all the shards added to the collection.
You've stumbled upon the idea behind a collection in Solr - regardless of which server you're querying, Solr is returning the result of the collection to you, including all documents added to that collection. The only difference in the three requests you're making, is which server is responsible for returning the result to the client and making the requests to fetch results from the other cores.
If you want to explore the contents of a single core, these cores are named collectionname_shardX_replicaY. You can examine the current cluster state if you download the json file from the Zookeeper instance - this will show you exactly which shards are located where.
You can also use the CoreAdmin API on a single node to examine which cores have been placed on that server. Be aware that you do NOT want to do any mutable actions through the CoreAdmin API when you're running in cloud mode.

Related

Load Multiple Cores in Banana which are created in Datastax Solr

I want to load multiple cores which are created in Datastax Cassandra Solr.
Objective is to create various Banana dashboards & provide to users on per Core basis.
Currently I am able to do it by changing:
$DSE_HOME/resources/banana/src/config.js
solr_core: "MY_OWN_CORE"
Is this possible to load multiple cores by giving list in above property?
Or what should be the best way for all Cassandra Tables/Solr Cores to have an individual Dashboard.
Currently I have followed this link to enable Banana in DSE & to load 1 Solr Core.
Current Version of DSE, I am using is DSE 5.0.11

The best way might be to have multiple instances of your banana directory, one per search core under $DSE_HOME/resources

My Problem solved with below steps: I need to give below:
$DSE_HOME/resources/banana/src/config.js
solr_core: "MY_OWN_CORE"
Still I can change or load another core from Banana UI.
1) Clone https://github.com/LucidWorks/banana to $DSE_HOME/resources/banana.
Make sure you've checked out the release branch (should be the default).
If you want, you can rm -rf .git at this point to save space, but it's not very big anyway.
2) Edit resources/banana/src/config.js and:
change solr_core to the core you're most frequently going to work with (only a convenience, you can pick a different one later on the settings for each dashboard.
change banana_index to banana.dashboards (can be anything you want, but modify step 3 accordingly). Not strictly necessary if you don't want to save dashboards to solr.
3) Post the banana schema from resources/banana/resources/banana-int-solr-4.5/banana-int/conf
Use the solrconfig.xml from wikipedia demo instead of the one provided by banana
Recommend calling the core banana.dashboards.
Not strictly necessary if you don't want to save dashboards to solr.
curl --data-binary #solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' "http://localhost:8983/solr/resource/banana.dashboards/solrconfig.xml"
curl --data-binary #schema.xml -H 'Content-type:text/xml; charset=utf-8' "http://localhost:8983/solr/resource/banana.dashboards/schema.xml"
curl -X POST -H 'Content-type:text/xml; charset=utf-8' "http://localhost:8983/solr/admin/cores?action=CREATE&name=banana.dashboards"
4) Edit resources/tomcat/conf/server.xml and add the following inside the tags:
5) If you've previously started DSE, remove resources/tomcat/work.
6) Start DSE in Solr mode, and go to http://localhost:8983/banana

Solr zkcli.sh - Is it possible to upload config set to multiple zookeeper in a cluster at once?

Usually we use zkcli.sh provided by Solr itself to manage config set for solrcloud collection. And sometimes a cluster of external zookeepers (typically 3 instance in my case) is used instead of a single one.
The issue is, when upload config set to zookeeper, seems could only upload to a single zookeeper at a time.
e.g zkcli.sh -cmd upconfig -z localhost:2181 -n dummy -d dummy/
To upload to the other 2 (or N-1) instances, need repeating the command as many times with a different host & port.
The questions is:
Does zkcli.sh from solr itself, provide some way to upload to the whole cluster in a single command?
Reason I ask this:
After all, when setup the zookeeper cluster, each instance is aware of the rest instances in the cluster, so I think it should be possible to provide an automatic sync mechanism.
The update within a zookeeper cluster is better to be an atomic operation, otherwise it might cause some issue, right?

To upload the config to more than one zookeeper use this: ./zkcli.sh -cmd upconfig -confdir /opt/solr/collection/conf -confname config_name -z <zookeeper1 ipaddress>:2181,<zookeeper2 ipaddress>:2181,<zookeeper3 ipaddress>:2181

How to scale and distribute the SOLR CLOUD nodes

I have initially setup the SOLR CLOUD with two solr nodes as shown below.
I have to add a new solr node (i.e) with additional shard and same number of replicas with the existing SOLR CLUSTER nodes.
I have already gone through the SOLR scaling and distributing https://cwiki.apache.org/confluence/display/solr/Introduction+to+Scaling+and+Distribution
But the above link contains information of scaling only for SOLR standalone mode. That's the sad part.
I have started the SOLR CLUSTER nodes using the following command
./bin/solr start -c -s server/solr -p 8983 -z [zkip's] -noprompt
Kindly share the command command for creating the new shard for adding new node.
Thanks in advance.

From my knowledge am sharing this answer.
Adding a new SOLR CLOUD /SOLR CLUSTER node is that having the copy of
all the SHARDs into the new box(through replication of all SHARDs).
SHARD : The actual data is equally splitted across the number of SHARDs we create (while creating the collection).
So while adding the new SOLR CLOUD node make sure that all the SHARD should be available on the new node(RECOMENDED) or as required.
Naming Standards of SOLR CORE in SOLR CLOUD MODE/ CLUSTER MODE
Syntax:
<COLLECTION_NAME>_shard<SHARD_NUMBER>_replica<REPLICA_NUMBER>
Example
CORE NAME : enter_2_shard1_replica1
COLLECTION_NAME : enter_2
SHARD_NUMBER : 1
REPLICA_NUMBER : 1
STEPS FOR ADDING THE NEW SOLR CLOUD/CLUSTER NODE
Create a core with the common collection name as we used in the existing SOLR CLOUD nodes.
Notes while creating a new core in new node
Example :
enter_2_shard1_replica1
enter_2_shard1_replica2
From the above example the maximum value/number of repilca of the corresponding shard is 2(enter_2_shard1_replica2)
So in the new node while creating a core, give the replica numbering as 3 "enter_2_shard1_replica3" so that SOLR will take this as the third replication of the corresponding SHARD.
Note : replica numbering should be in a incremental oreder of 1
Give time to replicate the data from the existing node to the new node.

SolrCloud: Unable to Create Collection, Locking Issues

I have been trying to implement a SolrCloud, and everything works fine until I try to create a collection with 6 shards. My setup is as follows:
5 virtual servers, all running Ubuntu 14.04, hosted by a single company across different data centers
3 servers running ZooKeeper 3.4.6 for the ensemble
2 servers, each running Solr 5.1.0 server (Jetty)
The Solr instances each have a 2TB, ext4 secondary disk for the indexes, mounted at /solrData/Indexes. I set this value in solrconfig.xml via <dataDir>/solrData/Indexes</dataDir>, and uploaded it to the ZooKeeper ensemble. Note that these secondary disks are neither NAS nor NFS, which I know can cause problems. The solr user owns /solrData.
All the intra-server communication is via private IP, since all are hosted by the same company. I'm using iptables for firewall, and the ports are open and all the servers are communicating successfully. Config upload to ZooKeeper is successful, and I can see via the Solr admin interface that both nodes are available.
The trouble starts when I try to create a collection using the following command:
http://xxx.xxx.xxx.xxx:8983/solr/admin/collections?action=CREATE&name=coll1&maxShardsPerNode=6&router.name=implicit&shards=shard1,shard2,shard3,shard4,shard5,shard6&router.field=shard&async=4444
Via the Solr UI logging, I see that multiple index creation commands are issued simultaneously, like so:
6/25/2015, 7:55:45 AM WARN SolrCore [coll1_shard2_replica1] Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
6/25/2015, 7:55:45 AM WARN SolrCore [coll1_shard1_replica2] Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
Ultimately the task gets reported as complete, but in the log, I have locking errors:
Error creating core [coll1_shard2_replica1]: Lock obtain timed out: SimpleFSLock#/solrData/Indexes/index/write.lock
SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
Error closing IndexWriter
If I look at the cloud graph, maybe a couple of the shards will have been created, others are closed or recovering, and if I restart Solr, none of the cores can fire up.
Now, I know what you're going to say: follow this SO post and change solrconfig.xml locking settings to this:
<unlockOnStartup>true</unlockOnStartup>
<lockType>simple</lockType>
I did that, and it had no impact whatsoever. Hence the question. I'm about to have to release a single Solr instance into production, which I hate to do. Does anybody know how to fix this?

Based on the log entry you supplied, it looks like Solr may be creating the data (index) directory for EACH shard in the same folder.
Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
This message was shown for two different collections and it references the same location. What I usually do, is change my Solr Home to a different directory, under which all collection "instance" stuff will be created. Then I manually edit the core.properties for each shard to specify the location of the index data.

Solr cloud distributed search on collections

Currently I have a zookeeper instance controlling replication on 3 physical servers. It is the solr integrated zookeeper. 1 shard, 1 collection.
I have a new requirement in which I will need a new static solr instance (1 new collection, no replication). Same schema as previous collection. A copy of this instance will also be placed on the 3 physical servers mentioned above. A caveat is that I need to perform distributed searches across the 2 collections and have the results blended.
Thanks to javacreed I now know that sharding is not in my solution. Previous questions answers here and here.
In my current setup I run the following command on the server running zookeeper -
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=1 -jar start.jar
Am I correct in saying that this will not change and I will now also manually start the non replicated collection. I really only need to change my search queries to include the 'collection' parameter? Something like -
http://localhost:8983/solr/collection1/select?collection=collection1,collection2
This example is from Solr documentation. I am slightly confused as to whether it should be ...solr/collection1/select?... or ...solr/collection2/select?... or if it even matters?
Thanks

Thanks for your kind word stewart.You can search it directly on solr as
http://localhost:8983/solr/select?collection=collection1,collection2
There is no need to mention any collection path since you are defining them in the collection parameters.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight