If a change has been made in Solr Schema configuration, Do we always need to rebuild the Index??
E.g. if I have changed the fieldtype of a field from general_text to string. Do I need to rebuild the whole index, or is there any shortcut?
It depends on what you change,
Suppose you change any field name/type definitely this calls for a re index as the data has to be analyzed as per the new applicable analysis pipeline. Same goes true for adding or deleting a field.
However there can be a rare scenario where re-index is not required. The case for this would be if you change query time analysis of a field type only . Since all the applicable analysis changes happen during query time , therefore merely a restart of the solr server is required.
Changes in schema would require a Reindex of the collection.
You would need to reindex the content as the analysis done at indexing time on the types of the field would be different.
If you don't reindex the Query time analysis performed for the field would be different from the one indexed and no matches would be found.
Also helpful How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema
Related
When is it safe to update the Solr schema and keep the existing indexes?
I am upgrading Solr to version 7.2 now, and some type definitions in my old schema generate warnings in the log like:
Solr loaded a deprecated plugin/analysis class [solr.CurrencyField]. Please consult documentation how to replace it accordingly.
Is it safe to update this type definition to the new solr.CurrencyFieldType and keep my existing indexes:
When the type is not used in the schema for document properties.
When the type is used in the schema for document properties.
Generally, what schema change will definitely require a total reindex of the documents?
If the field isn't being used, you can do anything you like with it - the schema is Solr's way of enforcing validation and expose certain low level Lucene settings for field configuration. If you've never indexed any content using the field, then you can update the field definition (or maybe better, remove it if you're not using it) without reindexing.
However, if you change the definition of an existing field to a different type (for example, when the int type changed from being a TrieInt to a Point field), it's a general rule that you'll have to reindex to avoid getting random weird, untraceable issues.
For TextFields, if you're not changing the field type - i.e. the field is still of the same type, but you're changing the analysis or tokenization change for the field, you might not have to reindex. If the change is only to the query part of the analysis chain, no reindexing is needed - if the change is to the indexing part (or both), it depends on what the change is - the existing tokens stored in the index won't change, so if you have indexed content without lowercasing it, and then add for example a lowercase filter for querying, you won't get a match for any existing tokens that contain uppercase. In that case you'll have to reindex to make your collection work properly again.
how to copy a string field data to integer field in solr without reindex because data volume is very high ?
I have a field name brvc which is string , I like to use this for sorting in solr but due to field type string this is not working properly .
I like to make a new field like
this but how to copy all data from brvc to brvc_new . data volume is very high .
You can use Atomic Updates to update documents, including adding the new field to the document.
However, the only way to update a document is to remove and reindex it. Functions that update documents in Lucene are just a convenient wrapper on the process of removing specified documents followed by adding new ones. If you only have some portion of the index that needs to be modified, then running updates may make sense. Keep in mind that all fields must be stored (or copyFields), otherwise their contents can not be retrieved from the existing index, and will be lost.
If you want to update every document with the new field, though, reindexing the whole thing is likely your best bet.
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.
On the documentation for Sunspot, it says:
If you make a change to the object's "schema" (code in the searchable block), you must reindex all objects so the changes are reflected in Solr
What happens if this procedure isn't followed?
Specifically, I have a fairly large index on Websolr, and if I just add a boolean field to it without reindexing, what will happen?
I'd like to be able to filter by true values of the boolean field, but I'll never need to filter by false or nil values. Will this work, or must this admonition to reindex always be obeyed?
In your case, if you add the field and do not index the data, it would still work.
However, the existing data would not have a value for the field.
Only the new documents inserted would have values for it.
You can surely filter on the documents based on the values and the existing documents would have a nil value for the field.
Usually it depends on what you change.
You would not need a reindex in if you change query time analysis of a field type.
A simple restart or core reload would work for you.
Changes in schema would require a reindex of the collection, if you want the value of the field for all documents.
If you change a field type, you would need to reindex the content as the analysis done at indexing time on the types of the field would be different.
If you don't reindex the Query time analysis performed for the field would be different from the one indexed and no matches would be found.
I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.
The question is about how to "re-index" all the docs?
One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.
Is there some better way can do this more efficiently? Thanks for your suggestion.
First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.
But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.
There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.
The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.
While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.
If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.
It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.
To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.
There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.
For optimizing, call from command line:
curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'