We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.
Related
My goal is to create a single searchable Azure Index that has all of the relevant information currently stored in many different sql tables.
I'm also using an Azure Cognitive Service to add additional info from related documents. Each document is tied to only a single item in my Index, but each item in the index will be tied to many documents.
According to my understanding, if two documents have the same value for the indexer's Key, then the index will overwrite the extracted information from the first document with the information extracted from the second. I'm hoping there's a way to append the information instead of overwriting it. For example: if two documents relate to the same index item, I want the values mapped to keyphrases for that item to include the keyphrases found in the first document and the keyphrases found in the second document.
Is this possible? Is there a different way I should be approaching this?
If it is possible, can I do it without having duplicate values?
Currently I have multiple indexes and I'm combining the search results from each one, but this seems inefficient and likely messes up the default scoring algorithm.
Every code example I find only has one document for each index item and doesn't address my problem. Admittedly, I haven't tried to set up my index as described above, because it would take a lot of refactoring, and I'm confident it would just overwrite itself.
I am currently creating my indexes and indexers programmatically using dotnet. I'm assuming my code isn't relevant to my question, but I can provide it if need be.
Thank you so much! I'd appreciate any feedback you can give.
Edit: I'm thinking about creating a custom skill to do the aggregation for me, but I don't know how the skill would access access everything it needs. It needs the extracted info from the current document, and it needs the previously aggregated info from previous documents. I guess the custom skill could perform a search on the index and get the item that way, but that sounds dangerously hacky. Any thoughts would be appreciated.
Pasting from docs:
Indexing actions: upload, merge, mergeOrUpload, delete
You can control the type of indexing action on a per-document basis, specifying whether the document should be uploaded in full, merged with existing document content, or deleted.
Whether you use the REST API or an SDK, the following document operations are supported for data import:
Upload, similar to an "upsert" where the document is inserted if it is new, and updated or replaced if it exists. If the document is missing values that the index requires, the document field's value is set to null.
merge updates a document that already exists, and fails a document that cannot be found. Merge replaces existing values. For this reason, be sure to check for collection fields that contain multiple values, such as fields of type Collection(Edm.String). For example, if a tags field starts with a value of ["budget"] and you execute a merge with ["economy", "pool"], the final value of the tags field is ["economy", "pool"]. It won't be ["budget", "economy", "pool"].
mergeOrUpload behaves like merge if the document exists, and upload if the document is new.
delete removes the entire document from the index. If you want to remove an individual field, use merge instead, setting the field in question to null.
I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.
Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.
I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.
I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.
I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.
I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.
The question is about how to "re-index" all the docs?
One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.
Is there some better way can do this more efficiently? Thanks for your suggestion.
First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.
But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.
There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.
The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.
While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.
If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.
It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.
To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.
There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.
For optimizing, call from command line:
curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'