Updating schema for a IBM Watson Retrieve and Rank Config

Updating schema for a IBM Watson Retrieve and Rank Config - solr

Are there ways to update the schema of the Solr config in IBM Watson's Retrieve and Rank service other than deleting, then uploading the config again.
I used the following example to create a new cluster, config and collection.
https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml
I started from the blank example config and updated the schema.
I now need to update the schema and add/modify some schema elements. Is there a way to do it without deleting and uploading the config again? How can this be done so that there is minimum downtime when making the change?

You can do this but you have to configure Solr to use managed schemas: https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig and then the schema APIs: https://cwiki.apache.org/confluence/display/solr/Schema+API.
Do note, however, the big caveat on the schema API page:
Re-index after schema modifications!
If you modify your schema, you will likely need to re-index all documents. If you do not, you may lose access to documents, or not be able to interpret them properly, e.g. after replacing a field type.
Modifying your schema will never modify any documents that are already indexed. Again, you must re-index documents in order to apply schema changes to them.
So it will depend on what specific schema changes you need as to whether or not you need to re-index.. If you're adding a new field, no problems... if you're modifying an existing field, this will only impact data you have not indexed yet and it might mean you should re-index (depending on your changes), etc.

Related

How can I download all documents from Retrieve and Rank (Solr)?

We have a Cloudant database on Bluemix that contains a large number of documents that are answer units built by the Document Conversion service. These answer units are used to populate a Solr Retrieve and Rank collection for our application. The Cloudant database serves as our system of record for the answer units.
For reasons that are unimportant, our Cloudant database is no longer valid. What we need is a way to download everything from the Solr collection and re-create the Cloudant database. Can anyone tell me a way to do that?

I'm not aware of any automated way to do this.
You'll need to fetch all your documents from Solr (and assuming you have a lot of them, do this in a paginated way - there are some examples of how to do this in the Solr doc) and add them into Cloudant.
Note that you'll only be able to do this for the fields that you have set to be stored in your schema. If there are important fields that you need in Cloudant that you haven't got stored in Solr, then you might be stuck. :(

You can replicate one Cloudant database to another which will create you an exact replica.
Another technique is to use a tool such as couchbackup which takes a copy of your database's documents (ignoring any deletions) and allows you to save the data in a text file. You can then use the couchrestore tool to upload the data file to a new database.
See this blog for more details.

How to copy a Watson retrieve-and-rank solr collection on Bluemix

We have a large Solr collection on Watson's Retrieve and Rank service that I need to copy to another collection. Is there any way to do this in Retrieve and Rank? I know Solr has backup and restore capability, but it uses the file system and I don't think I have access to that in Bluemix.

I'm not aware of any way to do this, beyond the brute force approach of just fetching every doc in the index and adding the contents to a different collection. (And even this would be limited to only letting you fetch the fields that you have stored in the first collection).

How to replace a group of documents without "downtime" in Solr?

i have a solr standalone server (not solr cloud), holding documents from a few different sources.
Routinely i need to update the documents for a source, typically i do this by deleting all documents from that source/group, and indexing the new documents for that source, but this creates a time gap where i have no documents for that source, and that's not ideal.
Some of these documents will probably remain from one update to the other, some change and could be updated, but some may disappear, and need to get deleted.
What's the best way to do this?
Is there a way to delete all documents from a source, but not committing, and in the same transaction index that source again and only then commit? (that would not create a time gap of no information for that source)
Is using core swapping a solution? (or am i over complicating?)

Seems like you need a live index which will keep serving queries while you update the index without having any downtime. In a way you are partially re-indexing your data.
You can look into maintaining two indices, and interacting with them using ALIASES.
Check this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multiple-indices.html
Although its on Elasticsearch website, you can easily use the concepts in solr.
Here is another link on how to create/use ALIASES
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
Collection aliases are also useful for re-indexing – especially when
dealing with static indices. You can re-index in a new collection
while serving from the existing collection. Once the re-index is
complete, you simply swap in the new collection and then remove the
first collection using your read side aliases.

Creating indexes on existing entity properties

When I started off with my project, I thought there was no need to create indexes on certain fields of entities but to generate certain daily reports, statistics we have a need to create indexes on some fields of existing entities.
As explained in the post Retroactive indexing in GAE Datastore, only way is to first change these properties from unindexed to indexed then retrieve and write all the entities again.
My question is if I take a back up from Datastore Admin and restore after changing the properties to indexed, will my project have all the required properties indexed? or do I need to retrieve and write through a program?
PS: My project is a java project on GAE

Edit: Work around I mentioned earlier does not work. The only way to change the field is to re-upload the entities. Sorry.

Solr/SolrNet: How can I update a document given a document unique ID?

I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!

Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.

If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.

Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Updating schema for a IBM Watson Retrieve and Rank Config - solr

Related

How can I download all documents from Retrieve and Rank (Solr)?

How to copy a Watson retrieve-and-rank solr collection on Bluemix

How to replace a group of documents without "downtime" in Solr?

Creating indexes on existing entity properties

Solr/SolrNet: How can I update a document given a document unique ID?

Categories

Resources