I have been setting SOLR up to automatically generate IDs for my documents by following this guide:
https://wiki.apache.org/solr/UniqueKey, which is working as intended.
Now, when inserting a document, I would like to check/ensure that the url field (just a string) is unique for all documents in the index. So whenever a new document is added, it should just update any existing document if an document already exists with that particular url.
The unique id is used to identify a document in another part of the system.
I have tried adding url to the url field, but it is just ignored and it is thus still possible to add a document with a non-unique url.
I'm using SOLR 4.10.2.
Any help is greatly appreciated!
You could prevent duplicates from entering the index by using the "De-duplication" Solr feature. Please have a look at the wiki for configuration and more details: https://cwiki.apache.org/confluence/display/solr/De-Duplication
There is a also a flag "overwriteDupes" that I believe issues an "update" command that overrides the old values, although it is not clearly documented in the wiki.
Related
My goal is to create a single searchable Azure Index that has all of the relevant information currently stored in many different sql tables.
I'm also using an Azure Cognitive Service to add additional info from related documents. Each document is tied to only a single item in my Index, but each item in the index will be tied to many documents.
According to my understanding, if two documents have the same value for the indexer's Key, then the index will overwrite the extracted information from the first document with the information extracted from the second. I'm hoping there's a way to append the information instead of overwriting it. For example: if two documents relate to the same index item, I want the values mapped to keyphrases for that item to include the keyphrases found in the first document and the keyphrases found in the second document.
Is this possible? Is there a different way I should be approaching this?
If it is possible, can I do it without having duplicate values?
Currently I have multiple indexes and I'm combining the search results from each one, but this seems inefficient and likely messes up the default scoring algorithm.
Every code example I find only has one document for each index item and doesn't address my problem. Admittedly, I haven't tried to set up my index as described above, because it would take a lot of refactoring, and I'm confident it would just overwrite itself.
I am currently creating my indexes and indexers programmatically using dotnet. I'm assuming my code isn't relevant to my question, but I can provide it if need be.
Thank you so much! I'd appreciate any feedback you can give.
Edit: I'm thinking about creating a custom skill to do the aggregation for me, but I don't know how the skill would access access everything it needs. It needs the extracted info from the current document, and it needs the previously aggregated info from previous documents. I guess the custom skill could perform a search on the index and get the item that way, but that sounds dangerously hacky. Any thoughts would be appreciated.
Pasting from docs:
Indexing actions: upload, merge, mergeOrUpload, delete
You can control the type of indexing action on a per-document basis, specifying whether the document should be uploaded in full, merged with existing document content, or deleted.
Whether you use the REST API or an SDK, the following document operations are supported for data import:
Upload, similar to an "upsert" where the document is inserted if it is new, and updated or replaced if it exists. If the document is missing values that the index requires, the document field's value is set to null.
merge updates a document that already exists, and fails a document that cannot be found. Merge replaces existing values. For this reason, be sure to check for collection fields that contain multiple values, such as fields of type Collection(Edm.String). For example, if a tags field starts with a value of ["budget"] and you execute a merge with ["economy", "pool"], the final value of the tags field is ["economy", "pool"]. It won't be ["budget", "economy", "pool"].
mergeOrUpload behaves like merge if the document exists, and upload if the document is new.
delete removes the entire document from the index. If you want to remove an individual field, use merge instead, setting the field in question to null.
I want to remove one specific value from a multivalued field in a large index, where I need to query first which documents contain that value, i.e.:
retrieve IDs of the documents containing the specific value.
partially update these documents (using remove).
Solr version is 5.1. I could update if necessary, but the change logs do not indicate any relevance to this issue.
I've tried the following query (in a few variations) on the /select endpoint through the Solr web interface (http://localhost:8983/solr/#/core/documents), trying to remove the value from all the documents:
{"id":"*", "field": {"remove":"value"} }
The server response is "success", but no document is updated.
What I could do is to query for field:value, extract the document IDs, and (programmatically) generate update queries for these IDs, similar to what has been indicated in this answer. But I would expect that there should be a more straight-forward solution.
The examples presented in the partial updates documentation and other related web pages are not really applicable here because they assume that the ID of the updated documents are known in advance.
Most other discussions about similar issues refer to old Solr versions, before partial updates were introduced (in Solr 4).
As far as I know, there is no "update by query" functionality in Solr at the current moment, so fetching and updating still is the suggested way.
Batching these updates (one select, one update) should however work as expected, reducing the number of requests made to Solr.
We are using Solr to store items that have been received and ingested through another service.
I am currently looking into a task to avoid duplicate items being created with the same id.
I am not an expert in Solr and trying pick up the task from someone who has left the company. The last suggestion about how to prevent duplication mentioned that it should be possible using a combination of defining unique id on the id field and using UpdateRequestProcessor chain. I don't know enough about the UpdateRequestProcessor chain to know the approach in mind. I know the ultimate goal was that when an item was sent to Solr with the same id as an existing id then an update would be performed rather than a create.
I have looked at Solr documentation about the UpdateRequestProcessor chain. Without more background information, those resources have not helped that much so far. I think I would benefit from Solr experts to help me get started or pointing me in the right direction.
You don't need to get a URP involved. It is much simpler than that. If your doc's id (defined in schema.xml as <uniqueKey>id</uniqueKey>
) is already an unique id, then, you don't need to do anything else. Indexing the same doc with same id twice will update it the second time (delete and new insert under the hood).
If your uniqueKey is not the unique id, then just rework the schema (and the app using Solr if it needs to), so they match.
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.
I have an application that contains a set of text documents that users can search for. Every user must be able to search based on the text of the documents. What is more, users must be able to define custom tags and associate them to a document. Those tags are used in two ways:
1)Users must be able to search for documents based on specific tag ids.
2)There must be facets available for the tags.
My solution was adding a Mutivalued field in each document to pose as an array that contains the tagids that this document has been tagged with. So far so good. I was able to perform queries based on text and tagids ( for example text:hi AND tagIds:56 ).
My question is, would that solution work in production mode in an environment that users add but also remove tags from the documents ? Remember , I have to have the data available in real time, so whenever a user removes/adds a tag I have to reindex that document and commit immediately. If that's not a good solution, what would be an alternative ?
Stackoverflow uses Solr - this is in case if you doubt Solr abilities in production mode.
And although I couldn't find much information on how they have implemented tags, I don't think your approach sounds wrong. Yes, tagged documents will have to be reindexed (that means a slight delay) but other than that I don't see anything wrong with it.