How to find delta between two SOLR collections - solr

We are using Lucid works Solr version 4.6.
Our source system basically stores data into two destination systems (one through real time and another thorough the batch mode). Data is ingested into Solr through the real time route.
We need to periodically synch the data ingested in Solr with the data ingested into the batch system.
The design we are currently trying to evaluate is to import the data from batch system into another Solr collection, but really not sure how to sync both collections (i.e the one with realtime data and second is through batch import).
I read through data import handlers but this will override the existing data in Solr. Is there any way in which we can identify the delta between the two collections and ingest that only.

There is no good way; there are a couple of things you can do:
When data is coming into the real time system there is a an import timestamp. Then do a range query to pull in the new stuff. I think new versions of Solr already have a field for this.
Log IDs of documents going into the first Solr and then index these.
Separate queue for the other collection

Related

Reading and Processing Millions of documents

I have 1 Million+ documents stored on a Cloud Storage and I want to annotate these documents by some application and then storing the result into the DB. This data is one time loaded and not any real time data, Data might be updated once in a week.
My Approach: So what I have done and trying to implement is:
List all the documents and store with their path in MongoDB with a flag to process
Writing some code to read the files which have flag to process
Get the result for the file and store it in MongoDB
Read the document from DB directly
This approach is very slow and taking a lot of efforts.
My Question is: How can I faster the process of listing all the documents and cleaning and then passing to the Application for annotation and then storing it in DB?
So just want to explore whether I can use some other tools as well to reduce coding efforts to achieve this process.

Transform a specific group of records from MongoDB

I've got a periodically triggered batch job which writes data into a MongoDB. The job needs about 10 minutes and after that I would like to receive this data and do some transformations with Apache Flink (Mapping, Filtering, Cleaning...). There are some dependencies between the records which means I have to process them together. For example I like to transform all records from the latest batch job where the customer id is 45666. The result would be one aggregated record.
Are there any best practices or ways to do that without implementing everything by myself (get distict customer ids from latest job, for each customer select records and transform, flag the transformed customers, etc....)?
I'm not able to stream it because I have to transform multiple records together and not one by one.
Currently I'm using Spring Batch, MongoDB, Kafka and thinking about Apache Flink.
Conceivably you could connect the MongoDB change stream to Flink and use that as the basis for the task you describe. The fact that 10-35 GB of data is involved doesn't rule out using Flink streaming, as you can configure Flink to spill to disk if its state can't fit on the heap.
I would want to understand the situation better before concluding that this is a sensible approach, however.

What's the simplest way to process lots of updates to Solr in batches?

I have a Rails app that uses Sunspot, and it is generating a high volume of individual updates which are generating unnecessary load on Solr. What is the best way to send these updates to Solr in batches?
Assuming, the changes from the Rails apps also update a persistence store, you can check for Data Import Handler (DIH) handler which can be scheduled periodically to update Solr indexes.
So instead of each update and commits triggered on Solr, the frequency can be decided to update Solr in batches.
However, expect a latency in the search results.
Also, Are you updating the Individual records and commit ? If using Solr 4.0 you can check for Soft and Hard Commits as well.
Sunspot makes indexing a batch of documents pretty straightforward:
Sunspot.index(array_of_docs)
That will send off just the kind of batch update to Solr that you're looking for here.
The trick for your Rails app is finding the right scope for those batches of documents. Are they being created as the result of a bunch of user requests, and scattered all around your different application processes? Or do you have some batch process of your own that you control?
The sunspot_index_queue project on GitHub looks like a reasonable approach to this.
Alternatively, you can always turn off Sunspot's "auto-index" option, which fires off updates whenever your documents are updated. In your model, you can pass in auto_index: false to the searchable method.
searchable auto_index: false do
# sunspot setup
end
Then you have a bit more freedom to control indexing in batches. You might write a standalone Rake task which iterates through all objects created and updated in the last N minutes and index them in batches of 1,000 docs or so. An infinite loop of that should stand up to a pretty solid stream of updates.
At a really large scale, you really want all your updates going through some kind of queue. Inserting your document data into a queue like Kafka or AWS Kinesis for later processing in batches by another standalone indexing process would be ideal for this at scale.
I used a slightly different approach here:
I was already using auto_index: false and processing solr updates in the background using sidekiq. So instead of building an additional queue, I used the sidekiq-grouping gem to combine Solr update jobs into batches. Then I use Sunspot.index in the job to index the grouped objects in a single request.

Running solr index on hadoop

I have a huge amount of data needs to be indexed and it took more than 10 hours to get the job done. Is there a way I can do this on hadoop? Anyone has done this before? Thanks a lot!
You haven't explained where does 10hr take? Does it take to extract the data? or does it take just to index the data.
If you are taking long time on the extraction, then you may use hadoop. Solr has a feature called bulk insert. So in your map function you could accumulate 1000s of record and commit for index in one shot to solr for large number of recods. That will optimize your performance alot.
Also what size is your data?
You could collect large number of records in reduce function of map/reduce job. You have to generate proper keys in your map so that large number of records go to single reduce function. In your custom reduce class, initialize solr object in setup/configure method, depending on your hadoop version and then close it in cleanup method.You will have to create a document collection object(in solrNet or solrj) and commit all of them in one single shot.
If you are using hadoop there is other option called katta. You can look over it as well.
You can write a map reduce job over your hadoop cluster which simply takes each record and sends it to solr over http for indexing. Afaik solr currently doesn't have indexing over cluster of machines, so it would be of worth to look into elastic search if you want to distribute your index also over multiple nodes.
There is a SOLR hadoop output format which creates a new index in each reducer- so you disteibute your keys according to the indices which you want and then copy the hdfs files into your SOLR instance after the fact.
http://www.datasalt.com/2011/10/front-end-view-generation-with-hadoop/

Indexing SQLServer data with SOLR

What is the best way of syncing the database change with solr incremental indexing? What is the best way of getting MSSQL server data to be indexed by solr?
Thank so much in addvance
Solr works with plugins. you will need to create your own data importer plugin that will be called in a periodically manner (based on notifications, time period that passed, etc). You will point your solr configuration to the class that will be called upon update.
Regarding your second Q, I used a text file, that holds a time date description. Each time Solr was started it looked at said file and retrieved from the DB the relevant data that was changed in the DB from that point on (the file is updated when the index is updated).
I would suggest reading a good solr/lucene book/guide such as lucidworks-solr-refguide-1.4 before getting started, so you will be sure that your architectural solution is correct

Resources