How to reindex all docs in Solr data

How to reindex all docs in Solr data - solr

I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.
The question is about how to "re-index" all the docs?
One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.
Is there some better way can do this more efficiently? Thanks for your suggestion.

First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.
But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.
There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.

The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.
While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.

If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.
It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.
To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.

There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.
For optimizing, call from command line:
curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'

Related

How can we implement archiving in SOLR?

I am new to Apache SOLR and I want to implement archiving in SOLR since my data is growing day by day. I am not very sure whether SOLR allows data archiving or not?
If anybody has any suggestions on this then please give it to me.

This question is pretty general so it's a bit hard to give a cut and dried answer, but if one thinks about archiving for a moment, there are two parts to it.
Removing old data
Storing the old data in an alternate location.
The first part is fairly easy in solr so long as you can identify a query that will select the "old" documents. For example if you have a field that records when you sent the data to solr called 'index_date' want to delete everything before Jan 1, 2014 you might do this:
curl http://localhost:8983/solr/update --data '<delete><query>indexed_date:[* TO 2014-01-01T00:00:00]</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
The second part requires more thought. The first question is, why would you want to move the data in solr to some other location. The answer to that more or less has to be because you think you might need it again. But ask yourself what the use case for that is, and you you might service that use case. Are you planning on putting the data back into solr at some later point if you want it? Is solr the only place where this data was stored and you need it for record keeping/audit only?
You will have to determine the second half of "archiving" based on your needs, but here's some things to think about: The data behind fields in solr that are stored="false" are already lost. You can not completely reconstruct the data that went into creating them. Fields for which stored="true" can be retrieved in xml/json/csv with a regular query, and then output to the long term storage of your choice. Many systems use solr as an index into the primary sources rather than using solr as a primary source itself. In this case there may be no need to archive the data, simply remove the data that is too old to be relevant in the search results, but of course make sure that your business team understands and agrees with this strategy before you do it! :)
EDIT: I happened to look back at this and when I re-read it I realized I left something out and there's a new development.
What I Left Out
The above delete by query strategy has the drawback that deleted documents remain in the index (just marked deleted), potentially wasting as much as 50% of your space (or more if you've run "optimize"! in the past). Here's a good article by Eric Erickson about deleting and space consequences:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
New Development
If time is the criteria for deletion and you followed the best practice I mentioned above about not having solr be the single source of truth (i.e. solr is just an index into a primary source, not the data store) then you may very well want to use the new Time Routed Aliases feature which keeps a set of temporally bounded collections and deletes the oldest collections. The great thing about deleting a collection rather than delete by query is that there's no merging to do. The segments for the index disappear as a whole, so there are no deleted documents hanging out wasting space.
http://lucene.apache.org/solr/guide/7_4/time-routed-aliases.html
Self Promotion Disclaimer: Along with David Smiley, I helped write this feature

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks

Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Does SchemaChange need Reindex

If a change has been made in Solr Schema configuration, Do we always need to rebuild the Index??
E.g. if I have changed the fieldtype of a field from general_text to string. Do I need to rebuild the whole index, or is there any shortcut?

It depends on what you change,
Suppose you change any field name/type definitely this calls for a re index as the data has to be analyzed as per the new applicable analysis pipeline. Same goes true for adding or deleting a field.
However there can be a rare scenario where re-index is not required. The case for this would be if you change query time analysis of a field type only . Since all the applicable analysis changes happen during query time , therefore merely a restart of the solr server is required.

Changes in schema would require a Reindex of the collection.
You would need to reindex the content as the analysis done at indexing time on the types of the field would be different.
If you don't reindex the Query time analysis performed for the field would be different from the one indexed and no matches would be found.
Also helpful How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema

Can Solr be used to match a document against keywords without storing the document?

I'm not entirely sure on the vocabulary, but what I'd like to do is send a document (or just a string really) and a bunch of keywords to a Solr server (using Solrnet), and have a return that tells me if the document is a match for the keywords or not, without having the document being stored or indexed to the server.
Is this possible, and if so, how do I do it?
If not, any suggestions of a better way? The idea is to check if a document is a match before storing it. Could it work to store it first with just a soft commit, and if it is not a match delete it again? How would this affect the index?

Index a document - send it to Solr to be tokenized and analyzed and the resulting strings stored
Store a document - send it to Solr to be stored as-is, without any modifications
So if you want a document to be searchable you need to index it first.
If you want a document (fields) to be retrievable in its original form, you need to store a document.
What exactly are you trying to accomplish? Avoid duplicate documents? Can you expand a little bit on your case...

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?

"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.

We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.