Solr denormalization and update of referenced data - solr

Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.

This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.

For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.

Related

why search engines need to reindex periodically but databases don't?

For example , search engines such as Sphinx , Lucene must merge there indexes periodically , but index of database can be updated dynamically . Why must the index of search engine be merged?
I don't know much about Sphinx but I believe the answer to this question will not be related to it.
First, why databases do not need updates periodically? This is because of database is the major data store for the applications most of the time. By this I mean, if you create, delete or update any data; that data is the means of a database record. You're removing data from there to get rid of it within the application or you first get the data from database to update since old version is kept there. All this indicates that databases are being updated all the time and your data is always up-to-date there.
Why an index of a search engine needs periodic reindexing? Index is the data store for a search engine basically that you're processing your data, putting it into index and then retrieving it by the means of your search system. That index is your secondary data resource. This does not hold for all applications but most of the time, you have database as primary resource that is being synchronized with your application as I explained above and then index where you're not reflecting all changes in real-time. Then you find your data in index a little bit outdated according to the database. That reindexing step is necessary for you to keep your data resources consistent.
As I said this explanation does not hold for all applications but it can give you the basic idea.
ps: You have a "index of database" phrase in your question but it is totally a different topic.

What is the most efficient way to update a Cloudant document numerous times?

The application I am building involves a customer's ID being documented when they first use the app. As they use the app, additional data needs to be updated to this customer's record on a Cloudant DB. Therefore, this solution will require numerous updates to documents for each customer that use it.
I have looked through the following documentation, but it appears that the recommended way to solve this problem is to first GET the document with a known ID, and then ADD the document again with the new data inserted.
https://docs.cloudant.com/tutorials/crud/index.html
https://docs.cloudant.com/guides/eventsourcing.html
It seems that this may be inefficient due to the fact that the code would be grabbing entire documents frequently, just to make a minor change, and then adding the new document back to the DB. Given the plethora of incremental updates I plan to use in my code, I a worried about efficiency. How would you advise to address the issue of efficiency?

Manipulate Solr index with lucene

I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.

Clustering documents in Solr

First of all i have to mention that i mean document clustering as a data mining technique, not a workload clustering or something like that.
From the beginning i will say what i have:
I get documents all the time. Let's assume those are news (It's rather similar thing).
Every time i get new batch of "news" i should add them to Solr index and get cluster information for that document. Store this information in the DB (so i should know each document's cluster).
I can't wait for cluster definition service/program to launch from time to time, but it should define clusters on the fly.
I want to be able to get clusters only for some period of time (For example i want to search for clusters only for documents that were loader one month ago).
I will have tens of thousands of new documents every day and overall base of several millions.
Long time ago i've been using some library (can't remember it's name), it recieved document as an input, and resulted cluster id, if it thought it's a new cluster then it created one, and so on. But it worked slowly (and i can't even remember the name of it).
I've found a book about Mahout, but still can't figure out what should i read and what is what i want. And, maybe, it's impossible to do that with Solr/Mahout without writing own plugins for Solr.
I will appreciate any thoughts, advices on how to build such system.
Thanks, in advance
I don't think you need any type of custom Solr plugin. That's because the classification for new documents can be determined during the normal indexing processes of your "news" and therefore you can just add it as a normal field to every Solr document.
When it comes to clustering and classification with Mahout, I'd say the Mahout in Action book is a good resource to start with.
Cheers.
Rather a old post, nevertheless let me respond, you can use carrot2 http://project.carrot2.org/index.html for solr result clustering. This is always on the fly.

Solr/SolrNet: How can I update a document given a document unique ID?

I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing

Resources