Versioning document changes in Vespa - vespa

I would like to allow for versioning of text in Vespa. If a user changes certain fields over time the changes would be tracked and versions could be restored.
I imagine a solution running in parallel to Vespa would be the way to go, with version numbers being stored in the vespa doc as unindexed data.
Any recommendations on a solution to use to do this? Something like http://jsonpatch.com?

I would just store each version as a separate document by including the version in the document id.

Related

Is data importing automatic in solr?

I'm using Solr 4.10.4 with MySQL on Windows.
Solr recommends setting the HTTP cache setting max-age to half of how often the index gets updated update the index.
So, the question is: does Solr automatically perform full/delta imports? If so, how do I control that?
If not, I assume it's up to me to update the index regularly?
#Howie
SOLR can be configured to pull data using a DataImportHandler
You should look at this documentation for details https://wiki.apache.org/solr/DataImportHandler
There is some documentation on scheduling the data pull but it appears that its not a configurable feature and requires some additional changes. The section http://wiki.apache.org/solr/DataImportHandler#Scheduling discuss the same. Also there is a Stackoverflow question on the same How can I Schedule data imports in Solr
Alternately you can also post data to SOLR as needed from your System.
Both strategies will work. It depends completely on what is better for your system. I would recommend going through the appropriate docs on Indexing https://wiki.apache.org/solr/FrontPage#Search_and_Indexing and then decide which strategy works better for your application

Solr denormalization and update of referenced data

Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.
This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.
For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.

Solr/SolrNet: How can I update a document given a document unique ID?

I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing

Database for storing large documents

Can anyone suggest a database solution for storing large documents which will have multiple branched revisions? Partial edits of content should be possible without having to update the entire document.
I was looking at XML databases and wondering about the suitability of them, or maybe even using a DVCS (like Mercurial).
It should preferably have Python bindings.
Try Fossil -- it has a good delta encoding algorithm, and keeps all versions. It's backed by a single SQLite database, and has both a web based and a command line UI.
This depends on your storage behavior and use case. If you plan to store a massive number of "document revisions" and keep historical versions, and can comply with a write-once-read-many pattern, you should look into something like Hadoop HDFS. This requires a lot of (cheap) infrastructure to run your cluster, but you will be able to keep adding revisions/data over time and will be able to quickly look it up using a MapReduce algorithm.

Is there a project that integrates CouchDb and Solr?

I would like to be able to search a CouchDB database using Solr. Are there any projects that provide such an integration?
I am also aware of CouchDB-Lucene. Is there a way to hook Solr into that?
Thanks!
It would make more sense to roll your own, given how wasy it easy. First you need to decide what kind of SOLR schema to use and how to map your CouchDB documents onto that schema. Then simple iterate through all the documents in a db Pagination in CouchDB? and generate SOLR <add> documents.
People do this all the time with all kinds of data sources. Since SOLR is essentially searching a single table, the hard work is often figuring out how to map your database format onto a single table. Read up on what you can do with the SOLR schema, and you may be surprised at how easy this is.
There is a CouchDB integration for ElasticSearch available, apart from feeding ElasticSearch with JSON on your own. Both work with schema-less JSON, so it's very easy to integrate them.
In terms of features, ElasticSearch would offer a comparable set to Solr (in addition to some unique features, of course.)
According to this
http://wiki.apache.org/couchdb/Related_Projects
there was a CouchDB-Solr2 project (scroll down to the end), which is no longer maintained.

Resources