Sync solr documents with database records - solr

I wonder there is a proper way to solr documents with sync database records. I usually have problems: there is solr documents while there are no database records referent by solr. It seems some db records has been deleted, but no trigger has been to update solr. I want to write a rake task to remove documents in solr that run periodically.
Any suggestions?
Chamnap

Yes, there is one.
You have to use the DataImportHandler with the delta import feature.
Basically, you specify a query that updates only the rows that have been modified, instead of rebuilding the whole index. Here's an example.
Otherwise you can add a feature in your application that simply trigger the removal of the documents via HTTP in both your DB and in your index.

I'm using Java + Java DB + Lucene (where Solr is based on) for my text search and database records. My solution is to backup then recreate (delete + create) the Lucene database to sync with my records on Java DB. This seems to be the easiest approach, only problem is that this is not advisable to run often. This also means that your records are not updated in real-time. I run my batch job nightly so that all changes reflect the next day. Hope this helps.
Also read an article about syncing Solr and db records here under "No synchronization". It states that it's not easy, but possible in some cases. Would be helpful if you specify your programming language so more people can help you.

In addition to the above, "soft" deletion by setting a deleted or deleted_at column is a great approach. That way you can run a script to periodically clear out deleted records from your Solr index as needed.
You mention using a rake task — is this a Rails app you're working with? Most Solr clients for Rails apps should support deleting records via an after_destroy hook.

Related

Problems with Solr in CKAN

I have a problem with solr and ckan.
I understood that Solr is not directly linked to PostgreSQL. The Solr index is maintained by the CKAN code itself.
I've lost all Solr's information because its broken so now I cant do queries in Solr. How can recover all the data in Solr?
Any crawling method that can help me? Or is it enough to dump my ckan database and export/import again?
You can use the search-index command for CKAN's CLI to rebuild to Solr index:
Rebuilds the search index. This is useful to prevent search indexes from getting out of sync with the main database.
For example:
paster --plugin=ckan search-index rebuild --config=/etc/ckan/std/std.ini
This default behaviour will clear the index and rebuild it with all datasets. If you want to rebuild it for only one dataset, you can provide a dataset name:
paster --plugin=ckan search-index rebuild test-dataset-name --config=/etc/ckan/std/std.ini

hbase-indexer+Phoenix: hbase replication not working?

I have a cluster with HBASE+Phoenix.
I've installed SOLR on it.
Now I'm trying to set up hbase replication for the cluster, following this manual:
https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html
Started hbase-indexer server, added hbase-indexer, put data via hbase shell, requested commit via browser.
But there are no changes in the collection in SOLR - zero new records.
Status 'replication' command in hbase shell increases sizeOfLogQueue with each PUT command to the indexed table.
When greping hbase log (hbase-hbase-regionserver-myserver.log) I found lots of records like this:
Indexer_hbaseindexer: Total replicated edits: 0, currently replicating
from:
hdfs://HDP-Test/apps/hbase/data/WALs/myserver,16020,1519204674681/myserver%2C16020%2C1519204674681.default.1519204995372
at position: 45671433
The position here never changes.
Issue author on this link tells that when changing WAL codec to IndexedWALEditCodec, the hbase replication stops.
Is it real that IndexedWALEditCodec stops hbase replication from working correctly? That shouldn't be true.
What may be a problem then? Any hint would be appreciated
env:
HDFS 2.7.3
HBASE 1.1.2
SOLR 5.5.2
HBASE INDEXER 2.2.8
p.s. When restarting Hbase, then querying solr commit, the changes appear. But afterwards it doesn't do anything.

About solr document not to be commited based on condition

I am trying to find a way to block the solr commit using solr api based on certain condition.
Currently, every solr document index has an unique id. So how could I update the solr api that it does not commit to solr index if already the id is present based on my below code:
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("id", indexUrl);
solrDoc.addField("price",100);
HttpSolrServer server = new HttpSolrServer(endpoint);
UpdateResponse response = server.add(solrDoc);
server.commit();
Thanks
You're mistaking Solr for an SQL server... Solr commit is nothing like SQL commit.
The idea behind Solr commit is to lower the amount of writes to the disk. Solr is not transactional... you don't have a rollback ability - except maybe erasing the tlog folder manually. You'll need to rewrite the commit feature entirely to do what you want.
You can query Solr for the id's before sending them, but you have no ACID guarantees that by the time they are sent and committed there won't be other documents with the same id already in.
You could maybe get an ACID guarantee by using a shared SQL server to generate an ID for you. Or by using zooKeeper to do the same (although harder to configure).

Should I Use Apache Solr

I am using Oracle 11g2 and we are having a schema where few tables are having more than 100 million rows (some of them are Varchar2 100 bytes). And we have to frequently do the LIKE based search on those tables. Sometimes we need to join the tables also. Insert / Updates are also happening very frequently for such tables (1000 insert / updates per second) by other applications.
So my question is, for my User Interface, should I use Apache Solr to let user search on these tables instead of SQL queries? I have tried SQL and it is really slow (considering amount of data I am having in my database).
My requirements are,
Result should come faster and it should be accurate.
It should have the latest data.
Can you suggest if I should go with Apache Solr, or another solution for my problem ?
EDIT : Should I consider Cassandra ? Is it having joins between tables ? Is it having LIKE based search or VARCHAR items of DB ?
You should consider using Hibernate Search for your requirement. You can use This link to get started. As you are using tables for your application, you will have lot of facilities like ORM and also Full Text Search through this module. If you want to use Apache Solr, you may have to make necessary code to sync data from your database to Apache Solr Synchronization.
Hope, this may help you
Thanks
I think you can use Solr in your case and you will get the data faster and accurate. About the latest...there would be some lag there...
as data needs to added/updated in solr index.
I have used Solr in my application...(I have used DataImportHandler)
I am updating the Solr index after 20 mins...
that means there is 20 mins lag for the latest update in the database.
But users are fine with it as its faster than Oracle search ...:)
Its been three years... its working fine and there is no issue with it.

Solr Cloud: Inconsistent Result

We are using Solr Cloud (4.3) for indexing data. We have 2 shard/2 replica servers in Solr Cloud.
We tried executing query on individual shard and it shows correct
When we execute same query (:) from Solr Admin Console, it display inconsistent results (number of records found is different each time).
What could be wrong? How can we troubleshoot it?
How Query is executed on different (shard/replica) and result combine? Is there any document which explain details about this?
I believe that you have to make sure that solr is doing soft commits to push information to the other replicas. This needs to be set to the frequency that you need the data to stay "current"
solr.autoSoftCommit.maxDocs=<max number of uncommitted documents before soft commit>
solr.autoSoftCommit.maxTime=<max time in ms before soft commit>
http://wiki.apache.org/solr/SolrConfigXml
SOLR autoCommit vs autoSoftCommit
Do a commit operation on solr Cloud after you index your data. Then refresh your results,One or two times it might show you different results,But after that it should be pretty consistent.

Resources