Remove Shards from SOLR Database - solr

I recently recovered a SOLR database that uses SOLR cloud to shard an index. I now have that database running on a single machine, but the data is still sharded--moreover now this is unnecessary.
How can I stop using SOLR cloud and merge these shards into a single collection?

I ended up using the Lucene Merge Index tool. The SOLR approaches did not work for me (obtuse errors).

Related

Hbase indexer for Solr

I’m trying to index data from a Hbase table using lucid works hbase indexer , I would like to know if Solr , Hbase indexer & Hbase have to use the same zookeeper?
Can my Solr instance be independent while hbase and Hbase indexer are together reporting to zookeeper1 while Solr reports to its own zookeeper ?
Im following the below url
https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html
It is up to our decisions whether go with the same zookeeper or the different independent one.
Because for hbase-zookeeper production setup zookeeper recommend the 3 node setup which means 3 zookeeper required for that setup. So we can make use of the same server for solr also.
It will help us to reduce the number of servers.
Zookeeper is light weight server which will be used to monitor solr server, so it would be good to keep the zookeeper outside the solr server for production run.

Problems with Solr in CKAN

I have a problem with solr and ckan.
I understood that Solr is not directly linked to PostgreSQL. The Solr index is maintained by the CKAN code itself.
I've lost all Solr's information because its broken so now I cant do queries in Solr. How can recover all the data in Solr?
Any crawling method that can help me? Or is it enough to dump my ckan database and export/import again?
You can use the search-index command for CKAN's CLI to rebuild to Solr index:
Rebuilds the search index. This is useful to prevent search indexes from getting out of sync with the main database.
For example:
paster --plugin=ckan search-index rebuild --config=/etc/ckan/std/std.ini
This default behaviour will clear the index and rebuild it with all datasets. If you want to rebuild it for only one dataset, you can provide a dataset name:
paster --plugin=ckan search-index rebuild test-dataset-name --config=/etc/ckan/std/std.ini

How to setup Solr Replication with two search servers?

Hi I'm developing rails project with sunspot solr and configure Solr replication.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why replication: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up replication on two servers, but I do not fully understand how replication working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there replication features other than listed?
Answer to this is similar to
How to setup Solr Cloud with two search servers?
What is the difference between Solr Replication and Solr Cloud?
Can we close this as duplicate?

Solr Distributed Search vs. Solr Cloud

For Solr 4.3 users, what would be the benefit of using Solr Distributed Search over Solr Cloud?
Or should all Solr deployment after 4.x just use Solr Cloud, and forget about Solr Distributed Search?
Benefit:
There won't be any benefit of Distributed search over solr Cloud. Solr Cloud is currently the most efficient way to deploy solr cluster. It takes care of all your instances using zookeeper and is very efficient for high availability.
Efficient management
Zookeeper decides which of your documents go to which instance.
I have used Solr Cloud in production also and it work wonderfully for high traffic scenarios.
Solr cloud it self resembles distributed search via solr.
No you can still use all deployments after 4.x as normal standalone solr instance.Just avoid zkHost parameter in bootstrap for that.
JOINs are not supported in SOLR cloud which is a big drawback.
If you want to control shards yourself, means which shard will contain which record, go for distributed search otherwise go for cloud search. Cloud manage all shards itself.
We can have multiple instances of SOLR so in case if one fails, we can move to other in distributed search. In cloud search, ZK manage all these things so if ZK fail, system will be down.

Sync solr documents with database records

I wonder there is a proper way to solr documents with sync database records. I usually have problems: there is solr documents while there are no database records referent by solr. It seems some db records has been deleted, but no trigger has been to update solr. I want to write a rake task to remove documents in solr that run periodically.
Any suggestions?
Chamnap
Yes, there is one.
You have to use the DataImportHandler with the delta import feature.
Basically, you specify a query that updates only the rows that have been modified, instead of rebuilding the whole index. Here's an example.
Otherwise you can add a feature in your application that simply trigger the removal of the documents via HTTP in both your DB and in your index.
I'm using Java + Java DB + Lucene (where Solr is based on) for my text search and database records. My solution is to backup then recreate (delete + create) the Lucene database to sync with my records on Java DB. This seems to be the easiest approach, only problem is that this is not advisable to run often. This also means that your records are not updated in real-time. I run my batch job nightly so that all changes reflect the next day. Hope this helps.
Also read an article about syncing Solr and db records here under "No synchronization". It states that it's not easy, but possible in some cases. Would be helpful if you specify your programming language so more people can help you.
In addition to the above, "soft" deletion by setting a deleted or deleted_at column is a great approach. That way you can run a script to periodically clear out deleted records from your Solr index as needed.
You mention using a rake task — is this a Rails app you're working with? Most Solr clients for Rails apps should support deleting records via an after_destroy hook.

Resources