Solr document disappears when I update it - solr

I am trying to update existing documents in a (Sentry-secured) Solr collection. The updates are accepted by Solr, but when I query, the document seems to have disappeared from the collection.
What is going on?
I am using Cloudera (CDH) 5.8.3, and Sentry with document-level access control enabled.

When using document-level access control, Sentry uses a field (whose name is defined in solrconfig.secure.xml, but the default is sentry_auth) to determine which roles can see that document.
If you update a document, but forget to supply a sentry_auth field, then the updated document doesn't belong to any roles, so nobody can see it - it becomes essentially invisible! This is easily done, because the sentry_auth field is typically not a stored field, so won't be returned by any queries.
You therefore cannot just retrieve a document, modify a field, then update the document - you need to know which roles that document belongs to, so you can supply a properly-populated sentry-auth field.
You can make the sentry_auth field a "required" field, in the Solr schema, which will prevent you from accidentally omitting it.
However, this won't prevent you from supplying a blank sentry-auth field (or supplying incorrect roles), either of which will also make the document "disappear".
Also note that you can update a document that you do not have document-level access to, provided you have write-access to the collection as a whole, and you have the ID of the document. This means that users can (deliberately or accidentally) over-write or delete documents that they cannot see. This is a design choice, made so that users cannot find out whether a particular document ID exists, when they do not have document-level access to it.
See the Cloudera documentation:
http://blog.cloudera.com/blog/2014/07/new-in-cdh-5-1-document-level-security-for-cloudera-search/
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_sentry_doc_level.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/search_sentry.html

Related

Is that possible to specify the copy field source as different collection field in SOLR?

I am having an issue with the partial update in SOLR. As I am having some non-stored fields in my collection the values in the non stored fields gone after the partial update. So, is that possible to use copy field to copy the original content for the non stored field from a different collection?
No. copyFields are invoked when a document is submitted for indexing, so I'm not sure how that would semantically work either. In practice what a copyField instruction does is to duplicate the field value when the document arrives to the server and copy it into fields with other names. That assumption won't make sense if there's a different collection involved - does it get invoked when documents are submitted for the other collection? (if that's the case - what with the other fields local to the actual collection).
Set the fields to stored if you want to use partial updates with fields that can't support in place updates (which have very peculiar requirements, such as being non-stored, non-indexed, single valued and has numeric docValues).

Howto: Reload entities in solr

Lets say you have a Solr core with multiple entities in your document. In my case the reason for that is that the index is fed by SQL queries and I don't want to deal with multiple cores. So, in case you add or change one entity configuration, you eventually have to re-index the whole shop, which can be time consuming.
There is a way, to delete and re-index one single entity, and this is how it works:
Prerequisite: your index entries have to have a field, which reflects the entity name. You could either do that via a constant in your SQL statement or by using the TemplateTransformer:
<field column="entityName" name="entityName" template="yourNameForTheEntity"/>
You can use this name to remove all entity items from the index via using the Solr admin UI. Go to documents,
request-Handler: /update
Document-Type: JSON
Document(s): delete: {query:{entityName:yourNameForTheEntity}}
After submitting the document, all related items are gone and you can see that via running a query on the query page:
{!term f=entityName}yourNameForTheEntity
Then go to the Dataimport page to re-load you entity. Uncheck the Clean checkbox, select your entity and Execute.
After the indexing is complete, you can go back to the query page and check the result.
That's it.
Have fun,
Christian

Reloading External file field with server up

I am trying to implement an external file field in order to change ranking values in Solr.
I've defined a field and field type in the schema and, in the "solrconfig.xml", bellow the <query> tags, created the external file and added the reload listeners as described in the ref guide:
After server start up, I'm able to sort the documents based on that previous created field, however, when i change the values while the server is up and when I make a new search query, I'm not able to see the updated rank list (neither the updated rank scores).
I also tried adding a reload request handler as suggested in another post and tried a force commit (http://HOST:PORT/solr/update?commit=true), but it says:
DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
DirectUpdateHandler2 end_commit_flush
Any suggestions?
Using ExternalFileFields for scoring is really not that useful any more, since Solr and Lucene now supports In-place updates for values that uses docValues.
You can then use those fields directly from your document for scoring, and you can update them without having to update the whole document. That way you don't have to reload anything externally, and your caches can be managed automagically by Solr.
There are three conditions a field has to pass for in-place updates (that being said, atomic updates can also be used, but that requires all your fields to be set as stored):
An atomic update operation is performed using this approach only when
the fields to be updated meet these three conditions:
are non-indexed (indexed="false"), non-stored (stored="false"), single
valued (multiValued="false") numeric docValues (docValues="true")
fields;
the _version_ field is also a non-indexed, non-stored single valued
docValues field; and,
copy targets of updated fields, if any, are also non-indexed,
non-stored single valued numeric docValues fields.

How do I query revision history in ArangoDB?

I see the _rev in every document created in ArangoDB, but I have yet to see any information about using those revisions to access the change history for a document. More specifically, how do I query the revision history for a specific document to see the previous versions or even a specific version in time?
My understanding is that the revision (_rev) attribute is just there as a marker so you can know when a field was updated. You can't change it directly, but every time you UPDATE a document in a collection, the _rev value is updated.
To store historical values you would need to implement a process to archive the old values of a document when they get updated.
The _rev attribute can be very helpful when scanning a document and seeing if any values were changed. Rather than having to do a deep compare on a document and what you expect to see, you can just compare the _rev attribute with what you expect to see. If the database returns a different _rev value than what you were checking for then your code can respond to the document changing, however required.
Remember, you have access to the old version of a document when you execute an UPDATE or UPSERT command (the doco) and you could choose to return the OLD document contents to push off to an archive location, or process as you wish. The updated document will receive a new _rev value after that update.
The OLD value does not persist after the return of the UPDATE or UPSERT command, so you'll have to archive it right away or the older document will be lost.

How do you update data in Solr 4?

We need to update the index of Solr 4 but are getting some unexpected results. We run a C# program that uses SolrNet to do an AddRange(). In this process, we're adding new documents and also trying to update existing ones.
We're noticing that some records' fields get updated with the latest data, while others still show the old information. Should we be using the information indicated in the documentation?
The documentation indicates we can set an update="set|add|inc" on the field. If we'd like the existing record to be updated, should we use set? Also, when we delete a field, to have it removed, do we need to shut down Solr and restart? Or set null="true"?
Can you point us to some good information on doing updates to Solr data? Thank you.
The documenation reference that you list describes the parameters for Atomic Updates in Solr 4, which is currently not supported in SolrNet - see issue 199 for more details.
Until this support has been added to SolrNet, your only option for updating documents in the index is to resend the entire document (object in C#) with the required updated/deleted feilds set appropriately. Internally Solr will re-add the document to the index with the updated fields.
Also, when you are adding/updating documents in the index, these changes will not be visible to queries against the index until a commit has been issued. I would recommend using the CommitWithin option of AddParameters to allow Solr to handle this internally, this is described in detail in the SolrWiki - CommitWithin.

Resources