I have a cluster with HBASE+Phoenix.
I've installed SOLR on it.
Now I'm trying to set up hbase replication for the cluster, following this manual:
https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html
Started hbase-indexer server, added hbase-indexer, put data via hbase shell, requested commit via browser.
But there are no changes in the collection in SOLR - zero new records.
Status 'replication' command in hbase shell increases sizeOfLogQueue with each PUT command to the indexed table.
When greping hbase log (hbase-hbase-regionserver-myserver.log) I found lots of records like this:
Indexer_hbaseindexer: Total replicated edits: 0, currently replicating
from:
hdfs://HDP-Test/apps/hbase/data/WALs/myserver,16020,1519204674681/myserver%2C16020%2C1519204674681.default.1519204995372
at position: 45671433
The position here never changes.
Issue author on this link tells that when changing WAL codec to IndexedWALEditCodec, the hbase replication stops.
Is it real that IndexedWALEditCodec stops hbase replication from working correctly? That shouldn't be true.
What may be a problem then? Any hint would be appreciated
env:
HDFS 2.7.3
HBASE 1.1.2
SOLR 5.5.2
HBASE INDEXER 2.2.8
p.s. When restarting Hbase, then querying solr commit, the changes appear. But afterwards it doesn't do anything.
Related
I’m trying to index data from a Hbase table using lucid works hbase indexer , I would like to know if Solr , Hbase indexer & Hbase have to use the same zookeeper?
Can my Solr instance be independent while hbase and Hbase indexer are together reporting to zookeeper1 while Solr reports to its own zookeeper ?
Im following the below url
https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html
It is up to our decisions whether go with the same zookeeper or the different independent one.
Because for hbase-zookeeper production setup zookeeper recommend the 3 node setup which means 3 zookeeper required for that setup. So we can make use of the same server for solr also.
It will help us to reduce the number of servers.
Zookeeper is light weight server which will be used to monitor solr server, so it would be good to keep the zookeeper outside the solr server for production run.
I have a problem with solr and ckan.
I understood that Solr is not directly linked to PostgreSQL. The Solr index is maintained by the CKAN code itself.
I've lost all Solr's information because its broken so now I cant do queries in Solr. How can recover all the data in Solr?
Any crawling method that can help me? Or is it enough to dump my ckan database and export/import again?
You can use the search-index command for CKAN's CLI to rebuild to Solr index:
Rebuilds the search index. This is useful to prevent search indexes from getting out of sync with the main database.
For example:
paster --plugin=ckan search-index rebuild --config=/etc/ckan/std/std.ini
This default behaviour will clear the index and rebuild it with all datasets. If you want to rebuild it for only one dataset, you can provide a dataset name:
paster --plugin=ckan search-index rebuild test-dataset-name --config=/etc/ckan/std/std.ini
Just read through the index update strategies document below but couldn't get the clear answer on which strategy is best for SOLR search implementation:
https://doc.sitecore.net/sitecore_experience_platform/search_and_indexing/index_update_strategies
We have setup the master and slave Solr endpoints where master will be used for create/update. And slave for reading only.
Appreciate if you could suggest the indexing strategy to be used for:
Content Authoring
Content Delivery
Solution is hosted in azure web apps and content delivery can be scaled up or down from 1-N number at any time.
I'm planning to configure below:
Only CA have a OnPublishEndAsync
All CDs will not have any indexing strategy.
Appreciate if you could suggest a solution that has worked for you. Also how do we disable indexing strategy?
Thanks.
Usually when you use replication in Solr (master + slave Solr servers), it should be configured like that:
Content Authoring (CM server):
connects to Solr master server.
It runs syncMaster strategy for master database, and onPublishEndAsync for web database.
Content Delivery (CD servers):
connects to Solr slave server (or to some load balancer if there are multiple Solr slave servers).
has all the indexing strategies set to manual - they should NEVER update Slave solr servers.
With this solution, CD servers always can get results from Solr, even if there is full index rebuild in progress (this happens on Master Solr server and data is copied to Slaves after it's finished).
You should think about having 2 Solr Slave servers and load balancer for them. If you do this:
If Solr master is down for some reason, slaves still answers to requests from CD boxes. You can safely restart master, reindex, and the only thing you lost is that you didn't have 100% up to date search results on CD for some time.
If one of the Solr slave servers is down, second slave server still answers to the request and load balancer should redirect all the traffic to the slave server which works.
Under what conditions does solr starts replicating from the start, we have noticed that in our master slave setup solr periodically start replicating the entire index from the beginning.
We have not made any changes to schema or config files, in-spite of that full replication get's triggered. How can this be avoided.
Regards,
Ayush
I wonder there is a proper way to solr documents with sync database records. I usually have problems: there is solr documents while there are no database records referent by solr. It seems some db records has been deleted, but no trigger has been to update solr. I want to write a rake task to remove documents in solr that run periodically.
Any suggestions?
Chamnap
Yes, there is one.
You have to use the DataImportHandler with the delta import feature.
Basically, you specify a query that updates only the rows that have been modified, instead of rebuilding the whole index. Here's an example.
Otherwise you can add a feature in your application that simply trigger the removal of the documents via HTTP in both your DB and in your index.
I'm using Java + Java DB + Lucene (where Solr is based on) for my text search and database records. My solution is to backup then recreate (delete + create) the Lucene database to sync with my records on Java DB. This seems to be the easiest approach, only problem is that this is not advisable to run often. This also means that your records are not updated in real-time. I run my batch job nightly so that all changes reflect the next day. Hope this helps.
Also read an article about syncing Solr and db records here under "No synchronization". It states that it's not easy, but possible in some cases. Would be helpful if you specify your programming language so more people can help you.
In addition to the above, "soft" deletion by setting a deleted or deleted_at column is a great approach. That way you can run a script to periodically clear out deleted records from your Solr index as needed.
You mention using a rake task — is this a Rails app you're working with? Most Solr clients for Rails apps should support deleting records via an after_destroy hook.