Refresh index in Solr - solr

In case of integration tests, before a tests starts I put documents in Solr, and wait (with a sleep...) for Solr to index.
With Elasticsearch, I know it is possible to refresh an index.
Is it possible to do the same with Solr ? And how to proceed ?

I suppose the reason you want to refresh the index is because you want near real time search. Essentially you want the search to reflect the added document instantaneously.
In solr usually its controlled by softCommits or hardCommit with openSearcher=true.
Read here more about this
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
The gist is this
Hard commits are about durability, soft commits are about visibility
Now if i understand you are doing this all for testing purpose, so you can not probably change the softcommit time for your collection(As this will have other implications).
I think however you can force solr to commit the changes while indexing as follows:
http://localhost:8983/solr/my_collection/update?softCommit=true
So adding softCommit=true will cause an explicit commit to happen. you can use the above after you add bunch of docs so that make all of them appear in the index together or alternatively you can add softCommit=true in each request for indexing doc.
However every time you do a soft commit it invalidates all top level caches.(Read more about all this in the above link.)
Note: Please be aware however the usual recommendation is to not call commits externally.

Related

Solr - Get more next "cursorMark" indices to allow a real pagination

i'm developing an application, and the database is managed by Solr v8.1.
I have the necessity of create a pagination system, and i have read that cursors are advised for this type of operations.
The problem is: if i want create a pagination system, that will show to the end-user more than 1 next page, how can i do this?
Normally solr will return only 1 nextCursor index, but what about next 2/3/4 or more pages? Is this possible? Is possible have the same behaviour for previous cursors?
Checking the documentation, seems a continue fetch using the next cursor is mandatory, but i don't think that this is a smart solution.
Thanks
Sounds like what you want is regular pagination if those are important features. CursorMarks are (very) useful for certain use cases, but might not give you any additional performance in your case.
You can however use cursorMarks, but a cursorMark won't tell you how far into a result set you've come (or how many rows are left - just how many rows there are in total. You can still keep track of this manually in your UI). The cursorMark only tells Solr "this is the last entry I showed, so start returning values from here..". This is useful for deep pagination across a cluster with many nodes, as it greatly reduces the required number of results to fetch from each node in the cluster.
If you decide to use a cursorMark, keep track of the current offset, the page size and the page number in your URL. You won't be able to let people skip directly to page X, but you can at least show how many results that remain (this is the same strategy as applied by Gmail).

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

solr 4's atomic updates insight

Are atomic updates significantly faster than fetching data from a source and then making a whole new document and indexing it. Basically I would like to know how exactly solr's atomic updates work?
It actually reindexes the whole document, see http://wiki.apache.org/solr/Atomic_Updates.
Atomic update could be faster because it does not involve fetching the current document from Solr first and then reposting the modified document. You can save on network time. Internally Solr will use the existing values for fields not specified in the atomic update (which is why you need to keep all values as stored).
Atomic update also helps you avoid conflicts since you need not worry if somebody else changes the document by the time you post your modified document. This problem could be dealt by using optimistic concurrency also.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

How to reindex all docs in Solr data

I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.
The question is about how to "re-index" all the docs?
One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.
Is there some better way can do this more efficiently? Thanks for your suggestion.
First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.
But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.
There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.
The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.
While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.
If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.
It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.
To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.
There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.
For optimizing, call from command line:
curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'

Resources