We have a large Solr collection on Watson's Retrieve and Rank service that I need to copy to another collection. Is there any way to do this in Retrieve and Rank? I know Solr has backup and restore capability, but it uses the file system and I don't think I have access to that in Bluemix.
I'm not aware of any way to do this, beyond the brute force approach of just fetching every doc in the index and adding the contents to a different collection. (And even this would be limited to only letting you fetch the fields that you have stored in the first collection).
Related
There is a use-case for us, where we spin-up an embedded solr-server (using the SolrJ EmbeddedSolrServer api) from a remote solr instance. This is so that we can serve documents extremely fast in a query pipeline.
One of the things I am stuck at is the determination of if the remote solr instance has been modified in any ways since the last sync was done. Obviously, a naive way to do is compare docs. one each at a time. However, that would be extremely inefficient and completely negate the entire purpose of being fast.
Thanks for any tips or recommendations.
Each version of the Lucene index is assigned a version number. This version number is exposed through the Replication Handler (which you might already be using to replicate the index to your local embedded Solr instance):
http://host:port/solr/core_name/replication?command=indexversion
Returns the version of the latest replicatable index on the specified master or slave.
If you want to do it more manually, you can use the _version_ field that is automagically added to all documents in recent version of Solr, and use that to fetch any _version_ values that is larger than the current, largest version in your index. This assumes you use the default _version_ numbering (which you kind of have to, since it's also used internally for Solr Cloud).
If you want to track the individual documents, then you can have a date field which will be applied for every document on the solr side.
I mean you can add a new date field to the schema file which will have named as UpdateDateTime and this field is updated for every time the document entity is modified or newly added document.
I am not very sure how are you maintaining the deleting of documents on the solr side. If you are not maintaining the deletion then you can have another boolen field which will be isDeleted.
We have a Cloudant database on Bluemix that contains a large number of documents that are answer units built by the Document Conversion service. These answer units are used to populate a Solr Retrieve and Rank collection for our application. The Cloudant database serves as our system of record for the answer units.
For reasons that are unimportant, our Cloudant database is no longer valid. What we need is a way to download everything from the Solr collection and re-create the Cloudant database. Can anyone tell me a way to do that?
I'm not aware of any automated way to do this.
You'll need to fetch all your documents from Solr (and assuming you have a lot of them, do this in a paginated way - there are some examples of how to do this in the Solr doc) and add them into Cloudant.
Note that you'll only be able to do this for the fields that you have set to be stored in your schema. If there are important fields that you need in Cloudant that you haven't got stored in Solr, then you might be stuck. :(
You can replicate one Cloudant database to another which will create you an exact replica.
Another technique is to use a tool such as couchbackup which takes a copy of your database's documents (ignoring any deletions) and allows you to save the data in a text file. You can then use the couchrestore tool to upload the data file to a new database.
See this blog for more details.
i have a solr standalone server (not solr cloud), holding documents from a few different sources.
Routinely i need to update the documents for a source, typically i do this by deleting all documents from that source/group, and indexing the new documents for that source, but this creates a time gap where i have no documents for that source, and that's not ideal.
Some of these documents will probably remain from one update to the other, some change and could be updated, but some may disappear, and need to get deleted.
What's the best way to do this?
Is there a way to delete all documents from a source, but not committing, and in the same transaction index that source again and only then commit? (that would not create a time gap of no information for that source)
Is using core swapping a solution? (or am i over complicating?)
Seems like you need a live index which will keep serving queries while you update the index without having any downtime. In a way you are partially re-indexing your data.
You can look into maintaining two indices, and interacting with them using ALIASES.
Check this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multiple-indices.html
Although its on Elasticsearch website, you can easily use the concepts in solr.
Here is another link on how to create/use ALIASES
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
Collection aliases are also useful for re-indexing – especially when
dealing with static indices. You can re-index in a new collection
while serving from the existing collection. Once the re-index is
complete, you simply swap in the new collection and then remove the
first collection using your read side aliases.
We are leveraging Solr capabilities to support full document search capabilities whereby users can search on the content within the documents. Further, metadata info is associated with each of the documents so that search could be achieved on metadata as well
Till this time everything is fine. However when only the metadata info needs to be updated (i.e. the document itself has not undergone any change), I am not able to figure suitable mechanism whereby I could only update the metadata info (and is not required to re-index the document). Since I could not figure any appropriate solution, I am re-indexing the document as well as updating the associated metadata info. I know that this is an inelegant solution. Seek you help to know ways & means to achieve metadata info update without the need to re-index the binary document
If its the Metadata apart from the metadata retrieved from the document itself, you can check for Partial updates to the Document with Solr.
With Solr 4.0 you can do a Partial update of all those document with just the fields that have changed will keeping the complete document same. The id should match.
However, if the metadata is inbuilt document metadata you would probably need to reindex the data as the retrieval is done by Tika OR you have a seperate program to use Tika independant to retrieve the document metadata and update the document partially.
I need a document based database where I can update single documents but when I get them I'll get all documents at once.
In my app I have no need to search and there will be up to 100k documents.
The db should be hosted on a dedicated machine or VM and have a web api.
What is the best choice?
This is probably not so smart to do. If you only update one doc, why would you want to retrieve all the other 99,999 documents when you know they didn't change?
Unless you are making a computation based on that, in which case you probably want the DB to do that for you and only fetch the result?