Execute bulk partial update on solr - solr

We want to update a field of more than one document without ID being passed. We may need to update 1.5 M documents in one go. It seems to us that executing atomic update specified in https://solr.apache.org/guide/8_2/updating-parts-of-documents.html#atomic-updates for a field in Solr cluster (8.4.1) is a good way to do that. As specified in https://solr.apache.org/guide/8_2/updating-parts-of-documents.html#field-storage, we have checked all fields in managed-schema file are having stored = true. We have below questions in regards to that:-
Can we try atomic update for updating a field of more than one document without ID being passed? If not, please suggest another way to do this.
Do we need to stop the traffic (read and write) while updating a field with atomic update?
Whether the atomic update is possible if we want to update 1.5 M documents in one go. Can we execute add, set etc. operations specified in https://solr.apache.org/guide/8_2/updating-parts-of-documents.html#atomic-updates for the same key value pair for about 1.5 M documents?
Please let us know if we have any doubts.
Best,

Related

Solr atomic update with stored copyField destination

I would like to use Solr atomic updates in combination with some stored copyField destination fields, which is not a recommended combination - so I wish to understand the risks.
The Solr documentation for Atomic Updates says (my emphasis):
The core functionality of atomically updating a document requires that
all fields in your schema must be configured as stored (stored="true")
or docValues (docValues="true") except for fields which are
<copyField/> destinations, which must be configured as stored="false".
Atomic updates are applied to the document represented by the existing
stored field values. All data in copyField destinations fields must
originate from ONLY copyField sources.
However, I have some copyField destinations that I would like to set stored=true so that highlighting works correctly for them (see this question, for example).
I need atomic updates so that an (unrelated) field can be modified by another process, without losing data indexed by my process.
The documentation warns that:
If destinations are configured as stored, then Solr will
attempt to index both the current value of the field as well as an
additional copy from any source fields. If such fields contain some
information that comes from the indexing program and some information
that comes from copyField, then the information which originally came
from the indexing program will be lost when an atomic update is made.
But what does that mean? Can someone give an example that demonstrates this information-loss problem?
I am unsure what is meant by "some information that comes from the indexing program and some information that comes from copyField", in concrete terms.
Is it safe to make one copyField destination stored, whilst atomically updating other fields, or vice versa? I have tried this out via the Solr Admin console, and have not been able to demonstrate any issues, but would like to be clear on what circumstances would trigger the problem.
It means that the copy field will have an additional value added from the source field effectively creating a multi-valued field in your copyField, which if it isn't defined as multi-valued then the field won't be of the right type and no further updates can be made to it, until you reindex everything. I'm currently struggling with this exact issue, because we need the values to come back as part of the response for the copyField, which means it needs to be stored, but by doing so breaks the structure of the document if we do an atomic update on a different field.

Updating documents with SOLR

I have a community website with 20.000 members using everyday a search form to find other members. The results are sorted by the last connected members. I'd like to use Solr for my search (right now it's mysql) but I'd like to know first if it's good practice to update the document of every member who would login in order to change their login date and time ? There will be around 20.000 update of documents a day, I don't really know if it's too much updating and could alter performances ? Tank you for your help.
20k updates/day is not unreasonable at all for Solr.
OTOH, for very frequently updating fields (imagine one user could log in multiple times a day so you might want to update it all those times), you can use External Fields to keep that field stored outside the index (in a text file) and still use it for sorting in solr.
Generally, Solr does not be used for this purpose, using your database is still better.
However, if you want to use Solr, you will deal with it in a way like the database. i.e every user document should has a unique field, id for example. When the user make log in, you may use an update query for that user's document last_login_date field by its id. You could able to know more about Partial Update from this link.

Prevent Duplication in Solr using UpdateRequestProcessor chain

We are using Solr to store items that have been received and ingested through another service.
I am currently looking into a task to avoid duplicate items being created with the same id.
I am not an expert in Solr and trying pick up the task from someone who has left the company. The last suggestion about how to prevent duplication mentioned that it should be possible using a combination of defining unique id on the id field and using UpdateRequestProcessor chain. I don't know enough about the UpdateRequestProcessor chain to know the approach in mind. I know the ultimate goal was that when an item was sent to Solr with the same id as an existing id then an update would be performed rather than a create.
I have looked at Solr documentation about the UpdateRequestProcessor chain. Without more background information, those resources have not helped that much so far. I think I would benefit from Solr experts to help me get started or pointing me in the right direction.
You don't need to get a URP involved. It is much simpler than that. If your doc's id (defined in schema.xml as <uniqueKey>id</uniqueKey>
) is already an unique id, then, you don't need to do anything else. Indexing the same doc with same id twice will update it the second time (delete and new insert under the hood).
If your uniqueKey is not the unique id, then just rework the schema (and the app using Solr if it needs to), so they match.

solr 4's atomic updates insight

Are atomic updates significantly faster than fetching data from a source and then making a whole new document and indexing it. Basically I would like to know how exactly solr's atomic updates work?
It actually reindexes the whole document, see http://wiki.apache.org/solr/Atomic_Updates.
Atomic update could be faster because it does not involve fetching the current document from Solr first and then reposting the modified document. You can save on network time. Internally Solr will use the existing values for fields not specified in the atomic update (which is why you need to keep all values as stored).
Atomic update also helps you avoid conflicts since you need not worry if somebody else changes the document by the time you post your modified document. This problem could be dealt by using optimistic concurrency also.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

Resources