Solr: remove value with partial update by query - solr

I want to remove one specific value from a multivalued field in a large index, where I need to query first which documents contain that value, i.e.:
retrieve IDs of the documents containing the specific value.
partially update these documents (using remove).
Solr version is 5.1. I could update if necessary, but the change logs do not indicate any relevance to this issue.
I've tried the following query (in a few variations) on the /select endpoint through the Solr web interface (http://localhost:8983/solr/#/core/documents), trying to remove the value from all the documents:
{"id":"*", "field": {"remove":"value"} }
The server response is "success", but no document is updated.
What I could do is to query for field:value, extract the document IDs, and (programmatically) generate update queries for these IDs, similar to what has been indicated in this answer. But I would expect that there should be a more straight-forward solution.
The examples presented in the partial updates documentation and other related web pages are not really applicable here because they assume that the ID of the updated documents are known in advance.
Most other discussions about similar issues refer to old Solr versions, before partial updates were introduced (in Solr 4).

As far as I know, there is no "update by query" functionality in Solr at the current moment, so fetching and updating still is the suggested way.
Batching these updates (one select, one update) should however work as expected, reducing the number of requests made to Solr.

Related

When is Luke data distributed across Solr cores?

On a Solr installation with 2+ shards, when is the data returned by the LukeRequestHandler distributed across the shards? I ask because I want to be able to detect new (previously unseen) dynamic fields within a short amount of time after they are added.
Example desired sequence of events:
Assume dynamic field *_s
Query Luke and receive list of dynamic fields
Add document with field example_s
Query Luke and receive same list as before but with additional example_s in result (this currently doesn't happen)
Query collection for example_s:* and match the document added above
I am aware that newly added documents become immediately searchable even before being hard committed, but I am looking for a way to have that info appear in Luke too.
Info on the following would be useful:
Does Luke query all shards at request time, or just one? It would appear to only query one at random.
Exactly when does knowledge of previously unseen dynamic fields become distributed across all shards (equivalently, available to Luke)?
Can I configure the delay/trigger for this supposed Luke propagation in order to minimize the delay between addition of a document with a new dynamic field on an arbitrary shard and the moment it becomes visible in Luke responses on every other shard?
See https://issues.apache.org/jira/browse/SOLR-8127
Never.
As indicated by responses on the linked ticket, the Luke request handler isn't at a high enough level to understand multiple shards. Luke provides information about an index, not a collection, and certainly not a cluster.
You need to query each shard directly. This can be done by using the exact core path /solr/collection_shard1_replica1/admin/luke

Identifying documents by multiple unique keys in solr

I have been setting SOLR up to automatically generate IDs for my documents by following this guide:
https://wiki.apache.org/solr/UniqueKey, which is working as intended.
Now, when inserting a document, I would like to check/ensure that the url field (just a string) is unique for all documents in the index. So whenever a new document is added, it should just update any existing document if an document already exists with that particular url.
The unique id is used to identify a document in another part of the system.
I have tried adding url to the url field, but it is just ignored and it is thus still possible to add a document with a non-unique url.
I'm using SOLR 4.10.2.
Any help is greatly appreciated!
You could prevent duplicates from entering the index by using the "De-duplication" Solr feature. Please have a look at the wiki for configuration and more details: https://cwiki.apache.org/confluence/display/solr/De-Duplication
There is a also a flag "overwriteDupes" that I believe issues an "update" command that overrides the old values, although it is not clearly documented in the wiki.

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Sorting by recent access in Lucene / Solr

In my Solr queries, I want to sort most recently accessed documents to the top ("accessed" meaning opened by user action). No other search criteria has weight for me: of the documents with text matching the query, I want them in order of recent use. I can only think of two ways to do this:
1) Include a 'last accessed' date field in each doc to have Solr sort upon. Trie Date fields can be sorted very quickly, I'm told. The problem of course is keeping the field up to date, which would require storing each document's text so I can delete and re-add any document with an updated 'last accessed' field. Mutable fields would obviate this, but Lucene/Solr still doesn't offer mutable fields.
2) Alternatively, store the mutable 'last accessed' dates and keep them updated in another db. This would require Solr to return the full list of matching documents, which could be upwards of hundreds of thousands of documents. This huge list of document ids would then be matched up against dates in the db and then sorted. It would work OK for uncommon search terms, but not for broad, common search terms.
So the trade off is between 1) index size plus a processing cost every time a document is accessed and 2) big query overhead, especially for unfocused search terms
Do I have any alternatives?
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html
You should be able to do this with the atomic update functionality.
http://wiki.apache.org/solr/Atomic_Updates
This functionality is available as of Solr 4.0. It allows you to update a single field in a document without having to reindex the entire document. I only know about this functionality from the documentation. I have not used it myself, so I can't say how well it works or if there are any pitfalls.
Definitely use option 1, using SOLR queries and updating the lastAccessed field as needed.
Since SOLR 4.0 partial document updates are suported in several falvours: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
For your application it seems that a simple atomic update would be sufficient.
With respect to performance, this should work very well for large collections and fast document updates.

Solr - retrieving facet counts for unfiltered version of query

I'm using Solr for searching, and recently started using faceting to allow users to narrow their search. However, once the user filters by one of the facets, the other filter options are no longer returned in the facet results. This is expected, but not what I'd like.
Is there some way to return the facet fields and counts for the unfiltered query, without doing an extra search? For instance, if the user filters by category (by selecting a specific category), I'd like them to still be able to pick one of the other categories without having to explicitly remove the filter first. (That is, all of the categories—and their counts—should still be returned by Solr, so that I can include them on the page along with the filtered query set.)
I suspect this may not be possible. If it isn't I can just do an extra query per search, which would leave out the filter (and return 0 rows), as described in a previous StackOverflow question. But I thought I'd ask: does anyone know a way to do this without multiple queries?
This is called multi-select faceting and it is possible using specific LocalParams to exclude filters when faceting. See "Tagging and excluding Filters" for details.
This is a SO answer also explaining this but with an example provided:
SolrNet : Keep Facet count when filtering query,
and here is a fresh SOLR documentation URL, since URLs from both this and linked SO answers are outdated now:
https://solr.apache.org/guide/8_11/faceting.html#tagging-and-excluding-filters

Resources