How to replace a group of documents without "downtime" in Solr? - solr

i have a solr standalone server (not solr cloud), holding documents from a few different sources.
Routinely i need to update the documents for a source, typically i do this by deleting all documents from that source/group, and indexing the new documents for that source, but this creates a time gap where i have no documents for that source, and that's not ideal.
Some of these documents will probably remain from one update to the other, some change and could be updated, but some may disappear, and need to get deleted.
What's the best way to do this?
Is there a way to delete all documents from a source, but not committing, and in the same transaction index that source again and only then commit? (that would not create a time gap of no information for that source)
Is using core swapping a solution? (or am i over complicating?)

Seems like you need a live index which will keep serving queries while you update the index without having any downtime. In a way you are partially re-indexing your data.
You can look into maintaining two indices, and interacting with them using ALIASES.
Check this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multiple-indices.html
Although its on Elasticsearch website, you can easily use the concepts in solr.
Here is another link on how to create/use ALIASES
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
Collection aliases are also useful for re-indexing – especially when
dealing with static indices. You can re-index in a new collection
while serving from the existing collection. Once the re-index is
complete, you simply swap in the new collection and then remove the
first collection using your read side aliases.

Related

How can I tell if a Solr's index has changed, including any modification, addition, or deletion of a document?

There is a use-case for us, where we spin-up an embedded solr-server (using the SolrJ EmbeddedSolrServer api) from a remote solr instance. This is so that we can serve documents extremely fast in a query pipeline.
One of the things I am stuck at is the determination of if the remote solr instance has been modified in any ways since the last sync was done. Obviously, a naive way to do is compare docs. one each at a time. However, that would be extremely inefficient and completely negate the entire purpose of being fast.
Thanks for any tips or recommendations.
Each version of the Lucene index is assigned a version number. This version number is exposed through the Replication Handler (which you might already be using to replicate the index to your local embedded Solr instance):
http://host:port/solr/core_name/replication?command=indexversion
Returns the version of the latest replicatable index on the specified master or slave.
If you want to do it more manually, you can use the _version_ field that is automagically added to all documents in recent version of Solr, and use that to fetch any _version_ values that is larger than the current, largest version in your index. This assumes you use the default _version_ numbering (which you kind of have to, since it's also used internally for Solr Cloud).
If you want to track the individual documents, then you can have a date field which will be applied for every document on the solr side.
I mean you can add a new date field to the schema file which will have named as UpdateDateTime and this field is updated for every time the document entity is modified or newly added document.
I am not very sure how are you maintaining the deleting of documents on the solr side. If you are not maintaining the deletion then you can have another boolen field which will be isDeleted.

Manipulate Solr index with lucene

I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.

Solr denormalization and update of referenced data

Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.
This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.
For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

Solr/SolrNet: How can I update a document given a document unique ID?

I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing

Resources