Solr Cloud: Inconsistent Result - solr

We are using Solr Cloud (4.3) for indexing data. We have 2 shard/2 replica servers in Solr Cloud.
We tried executing query on individual shard and it shows correct
When we execute same query (:) from Solr Admin Console, it display inconsistent results (number of records found is different each time).
What could be wrong? How can we troubleshoot it?
How Query is executed on different (shard/replica) and result combine? Is there any document which explain details about this?

I believe that you have to make sure that solr is doing soft commits to push information to the other replicas. This needs to be set to the frequency that you need the data to stay "current"
solr.autoSoftCommit.maxDocs=<max number of uncommitted documents before soft commit>
solr.autoSoftCommit.maxTime=<max time in ms before soft commit>
http://wiki.apache.org/solr/SolrConfigXml
SOLR autoCommit vs autoSoftCommit

Do a commit operation on solr Cloud after you index your data. Then refresh your results,One or two times it might show you different results,But after that it should be pretty consistent.

Related

hbase-indexer+Phoenix: hbase replication not working?

I have a cluster with HBASE+Phoenix.
I've installed SOLR on it.
Now I'm trying to set up hbase replication for the cluster, following this manual:
https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html
Started hbase-indexer server, added hbase-indexer, put data via hbase shell, requested commit via browser.
But there are no changes in the collection in SOLR - zero new records.
Status 'replication' command in hbase shell increases sizeOfLogQueue with each PUT command to the indexed table.
When greping hbase log (hbase-hbase-regionserver-myserver.log) I found lots of records like this:
Indexer_hbaseindexer: Total replicated edits: 0, currently replicating
from:
hdfs://HDP-Test/apps/hbase/data/WALs/myserver,16020,1519204674681/myserver%2C16020%2C1519204674681.default.1519204995372
at position: 45671433
The position here never changes.
Issue author on this link tells that when changing WAL codec to IndexedWALEditCodec, the hbase replication stops.
Is it real that IndexedWALEditCodec stops hbase replication from working correctly? That shouldn't be true.
What may be a problem then? Any hint would be appreciated
env:
HDFS 2.7.3
HBASE 1.1.2
SOLR 5.5.2
HBASE INDEXER 2.2.8
p.s. When restarting Hbase, then querying solr commit, the changes appear. But afterwards it doesn't do anything.

How to handle solr replication when master goes down

I have solr setup, which is configured for Master and slave. The indexing is happening in master and slave is replicating the index at every 2 Min interval from master. So there is a delay of 2 Minutes in getting data from master to slave. Lets assume that my master was indexing at 10:42 some data but due to some hardware issue, master went down at 10:43. So now the data which was indexing at 10:42 was suppose to replicate on Slave by 10:44 (as we have set two minutes interval) Since now the master is not available, how to identify what the last indexed data in solr Master server. Is there way in solr log to track the index activity.
Thanks in Advance
Solr does log the indexing operations if you have the Solr log set to INFO. Any commit/add will show up in the log, so you can check the log for when the last addition was made. Depending on the setup, it might be hard to get the last log when then server is down, though.
You can reduce the time between replications to get more real time replication, or use SolrCloud instead (which should distribute the documents as they're being indexed).
There are also API endpoints (see which connections the Admin interface makes when browsing to the 'replication' status page) for getting the replication status, but those wouldn't help you if the server is gone.
In general - if the server isn't available, you'll have a hard time telling when it was last indexed to. You can work around a few of the issues by storing the indexing time outside of Solr from the indexing task, for example updating a value in memcache or MySQL every time you send something to be indexed from your application.

About solr document not to be commited based on condition

I am trying to find a way to block the solr commit using solr api based on certain condition.
Currently, every solr document index has an unique id. So how could I update the solr api that it does not commit to solr index if already the id is present based on my below code:
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("id", indexUrl);
solrDoc.addField("price",100);
HttpSolrServer server = new HttpSolrServer(endpoint);
UpdateResponse response = server.add(solrDoc);
server.commit();
Thanks
You're mistaking Solr for an SQL server... Solr commit is nothing like SQL commit.
The idea behind Solr commit is to lower the amount of writes to the disk. Solr is not transactional... you don't have a rollback ability - except maybe erasing the tlog folder manually. You'll need to rewrite the commit feature entirely to do what you want.
You can query Solr for the id's before sending them, but you have no ACID guarantees that by the time they are sent and committed there won't be other documents with the same id already in.
You could maybe get an ACID guarantee by using a shared SQL server to generate an ID for you. Or by using zooKeeper to do the same (although harder to configure).

Should I Use Apache Solr

I am using Oracle 11g2 and we are having a schema where few tables are having more than 100 million rows (some of them are Varchar2 100 bytes). And we have to frequently do the LIKE based search on those tables. Sometimes we need to join the tables also. Insert / Updates are also happening very frequently for such tables (1000 insert / updates per second) by other applications.
So my question is, for my User Interface, should I use Apache Solr to let user search on these tables instead of SQL queries? I have tried SQL and it is really slow (considering amount of data I am having in my database).
My requirements are,
Result should come faster and it should be accurate.
It should have the latest data.
Can you suggest if I should go with Apache Solr, or another solution for my problem ?
EDIT : Should I consider Cassandra ? Is it having joins between tables ? Is it having LIKE based search or VARCHAR items of DB ?
You should consider using Hibernate Search for your requirement. You can use This link to get started. As you are using tables for your application, you will have lot of facilities like ORM and also Full Text Search through this module. If you want to use Apache Solr, you may have to make necessary code to sync data from your database to Apache Solr Synchronization.
Hope, this may help you
Thanks
I think you can use Solr in your case and you will get the data faster and accurate. About the latest...there would be some lag there...
as data needs to added/updated in solr index.
I have used Solr in my application...(I have used DataImportHandler)
I am updating the Solr index after 20 mins...
that means there is 20 mins lag for the latest update in the database.
But users are fine with it as its faster than Oracle search ...:)
Its been three years... its working fine and there is no issue with it.

Sync solr documents with database records

I wonder there is a proper way to solr documents with sync database records. I usually have problems: there is solr documents while there are no database records referent by solr. It seems some db records has been deleted, but no trigger has been to update solr. I want to write a rake task to remove documents in solr that run periodically.
Any suggestions?
Chamnap
Yes, there is one.
You have to use the DataImportHandler with the delta import feature.
Basically, you specify a query that updates only the rows that have been modified, instead of rebuilding the whole index. Here's an example.
Otherwise you can add a feature in your application that simply trigger the removal of the documents via HTTP in both your DB and in your index.
I'm using Java + Java DB + Lucene (where Solr is based on) for my text search and database records. My solution is to backup then recreate (delete + create) the Lucene database to sync with my records on Java DB. This seems to be the easiest approach, only problem is that this is not advisable to run often. This also means that your records are not updated in real-time. I run my batch job nightly so that all changes reflect the next day. Hope this helps.
Also read an article about syncing Solr and db records here under "No synchronization". It states that it's not easy, but possible in some cases. Would be helpful if you specify your programming language so more people can help you.
In addition to the above, "soft" deletion by setting a deleted or deleted_at column is a great approach. That way you can run a script to periodically clear out deleted records from your Solr index as needed.
You mention using a rake task — is this a Rails app you're working with? Most Solr clients for Rails apps should support deleting records via an after_destroy hook.

Resources