Indexing SQLServer data with SOLR - sql-server

What is the best way of syncing the database change with solr incremental indexing? What is the best way of getting MSSQL server data to be indexed by solr?
Thank so much in addvance

Solr works with plugins. you will need to create your own data importer plugin that will be called in a periodically manner (based on notifications, time period that passed, etc). You will point your solr configuration to the class that will be called upon update.
Regarding your second Q, I used a text file, that holds a time date description. Each time Solr was started it looked at said file and retrieved from the DB the relevant data that was changed in the DB from that point on (the file is updated when the index is updated).
I would suggest reading a good solr/lucene book/guide such as lucidworks-solr-refguide-1.4 before getting started, so you will be sure that your architectural solution is correct

Related

cloudant dashdb sync issue

We have created a warehouse with source database in cloudant,
We had ran schema discovery process on near about 40,000 records initially.Our cloudant database consist of around 2 millions records.
Now the issue we are facing is that we have got many records in _OVERFLOW Table in DashDB (means that they have got rejected ) with error like "[column does not exist in the discovered schema. Document has not been imported.]"
Seems to me the issue is that cloudant database which is actually result of dbcopy ,contains partials in the docs and as those partials are created internally by cloudant with value which we can judge only after the partials gets created like "40000000-5fffffff" in the dd doesn't get discovered by schema discovery process and now all docs which have undiscovered partials are being rejected by cloudant-dashdb sync.
Does anyone has any idea how to resolve it..
The best option for you to resolve this is with a simple trick: feed the schema discovery algorithm exactly one document with the structure you want to create in your dashDB target.
If you can build such a "template" document ahead of time, have the algorithm discover that one and load it into dashDB. With the continuous replication from Cloudant to dashDB you can then have dbcopy load your actual documents into the database that serves as source for your cloudant-dashdb sync.
We had ran schema discovery process on near about 40,000 records initially.
Our database consist of around 2 millions records
Do all these 2 millions share the same schema? I believe not.
"[column does not exist in the discovered schema. Document has not been imported.]"
It means that during your initial 40'000 records scan application didn't find any document with that field.
Let's say sequence of documents in your Cloudant db is:
500'000 docs that match schema A
800'000 docs that match schema B
700'000 docs that match schema C
And your discovery process checked just first 40'000. It never got to schema B and C.
I would recommend to re-run discovery process and process all 2 millions records. It will take time, but will guarantee that all fields are discovered.

How to find delta between two SOLR collections

We are using Lucid works Solr version 4.6.
Our source system basically stores data into two destination systems (one through real time and another thorough the batch mode). Data is ingested into Solr through the real time route.
We need to periodically synch the data ingested in Solr with the data ingested into the batch system.
The design we are currently trying to evaluate is to import the data from batch system into another Solr collection, but really not sure how to sync both collections (i.e the one with realtime data and second is through batch import).
I read through data import handlers but this will override the existing data in Solr. Is there any way in which we can identify the delta between the two collections and ingest that only.
There is no good way; there are a couple of things you can do:
When data is coming into the real time system there is a an import timestamp. Then do a range query to pull in the new stuff. I think new versions of Solr already have a field for this.
Log IDs of documents going into the first Solr and then index these.
Separate queue for the other collection

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

Few questions about Solr. Transactions and Realtime search

I have a heldesk application in PHP/MySQL. I want to implement realtime Full text search and I have shortlisted Solr. MySQL database will store all the data and data required for search will be imported for building Solr index. All Search requests will be handled by Solr.
What I want is
Real time search. The moment someone updates a ticket, it should be available for search.
If multiple people update the ticket simultaneously, Solr should be able to handle the commits
As per my understanding of Solr, this is how I think the system will work. A user updates a ticket -> corrresponding database records modified -> a request is sent to Solr server to modify corresponding document in index.
I have read a book on Solr and below questions are troubling me.
The book mentions that
"commits are slow in Solr. Depending on the index size, Solr's
auto-warming configuration, and Solr's cache state prior to
committing, a commit can take a non-trivial amount of time. Typically,
it takes a few seconds, but it can take some number of minutes in
extreme cases"
If this is true then how will I know when the data will be availbale for search and how can I implemnt realtime search? Even if its taking a few seconds, it can't be real time. Also I don't want the ticket update operation to be slowed down (by adding extra step of updating Solr index)
It is also mentioned that
"there is no transaction isolation. This means that if more than one
Solr client were to submit modifications and commit them at
overlapping times, it is possible for part of one client's set of
changes to be committed before that client told Solr to commit. This
applies to rollback as well. If this is a problem for your
architecture then consider using one client process responsible for
updating Solr."
Doe it mean that that due to lack of transactional commits, Solr can mess up if multiple people update the ticket simultaneously?
Now the question before me is: Can I achieve the two using Solr? If yes, How?
Edit1:
Yeah! I came acorss a couple of similar questions but none has a staisfactory answer. So posting again. Sorry If you find it duplicate.
The functionality that you are requesting is known as Near Realtime Search also referred to as NRT. The work on NRT is still in progress, but there have been excellent incremental improvements to this support in Solr over the last couple of years. Please refer to the following links for more details on the current (versions 1.4 - 3.5) and future (ver 4.0) support for NRT.
NRT options
Solr Near Realtime Search for versions 3.5/3.4/3.3/3.2/1.4.1
Near Real Time Search ver 3.x
Near Realtime Search Tuning (ver 1.4 - 3.x)
Solr Near Realtime Search (ver 4.0)
Benchmarking the new Solr 'Near Realtime' improvements (ver 4.0)
Solr with Ranking Algorithm (ver 1.4 - 4.0)

What are some strategies for updating volatile data in Solr?

What are some strategies for updating volatile data in Solr? Imagine if you needed to model YouTube video data in a Solr index: how would you keep the "views" data fresh without swamping Solr in updates?
I would imagine that storing the "views" data in a different data store (something like MongoDB or Redis) that is better at handling rapid updates would be the best idea.
But what is the best way to update the index periodically with that data? Would a delta-import make sense in this context? What does a delta-import do to Solr in terms of performance for running queries?
First you need to define "fresh".
Is "fresh" 1ms? If so, by the time the value (the rendered html) gets to the browser, it's not fresh anymore, due to network latency. Does that really matter? For the vast majority of cases, no, true real-time results are not needed.
A more common limit is 1s. In that case, Solr can deal with that with RankingAlgorithm (a plugin) or soft commits (currently available in Solr 4.0 trunk only).
"Delta-import" is a term from DataImportHandler that doesn't have much intrinsic meaning. From the point of view of a Solr server, there's only document additions, it doesn't matter where they come from or if a set of documents represent the "whole" dataset or not.
If you want to have an item indexed within 1s of its creation/modification, then do just that, add it to Solr just after it's created/modified (for example with a hook in your DAL). This should be done asynchronously, and use RA or soft commits.
You might be interested in so-called "near-realtime search", or NRT, now available on Solr's trunk, which is designed to deal with exactly this problem. See http://wiki.apache.org/solr/NearRealtimeSearch for more info and links.
How about using the external file field ?
This helps you to maintain data outside of your index in a separate file, which you can refresh periodically without any changes to the index.
For data such as downloads, views, rank which is fast changing data this can be an good option.
More info # http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
This has some limitations, so you would need to check depending upon your needs.

Resources