I would like to use Solr atomic updates in combination with some stored copyField destination fields, which is not a recommended combination - so I wish to understand the risks.
The Solr documentation for Atomic Updates says (my emphasis):
The core functionality of atomically updating a document requires that
all fields in your schema must be configured as stored (stored="true")
or docValues (docValues="true") except for fields which are
<copyField/> destinations, which must be configured as stored="false".
Atomic updates are applied to the document represented by the existing
stored field values. All data in copyField destinations fields must
originate from ONLY copyField sources.
However, I have some copyField destinations that I would like to set stored=true so that highlighting works correctly for them (see this question, for example).
I need atomic updates so that an (unrelated) field can be modified by another process, without losing data indexed by my process.
The documentation warns that:
If destinations are configured as stored, then Solr will
attempt to index both the current value of the field as well as an
additional copy from any source fields. If such fields contain some
information that comes from the indexing program and some information
that comes from copyField, then the information which originally came
from the indexing program will be lost when an atomic update is made.
But what does that mean? Can someone give an example that demonstrates this information-loss problem?
I am unsure what is meant by "some information that comes from the indexing program and some information that comes from copyField", in concrete terms.
Is it safe to make one copyField destination stored, whilst atomically updating other fields, or vice versa? I have tried this out via the Solr Admin console, and have not been able to demonstrate any issues, but would like to be clear on what circumstances would trigger the problem.
It means that the copy field will have an additional value added from the source field effectively creating a multi-valued field in your copyField, which if it isn't defined as multi-valued then the field won't be of the right type and no further updates can be made to it, until you reindex everything. I'm currently struggling with this exact issue, because we need the values to come back as part of the response for the copyField, which means it needs to be stored, but by doing so breaks the structure of the document if we do an atomic update on a different field.
Related
I am new to apache solr and exploring some use cases that could potentially be applicable for my application.
In one of the use case, I have multiple mongodb instances pushing data to solr via mongo-connector. I am able to do so by running two instance of mongo-connector with two different mongo instance and using same solr core.
My question is: How do I handle a situation where I have a field in mongo-collection, say "startTime" which is of Date type in one mongo instance and another is treating it as long. I want this field to be treated as long type in solr. Does solr provide any sort of auto conversion or I will have to write my analyzer?
If you want both values to normalize to the same form, you should do that in the UpdateRequestProcessor (defined in solrconfig.xml). There is quite a number of them for various purposes, including date parsing. In fact, the schemaless mode is implemented by a chain of URPs, so that's an example you can review.
To process different Mongo instances in different ways, you can just define separate Update Request Handler endpoints (in solrconfig.xml again) and setup different processing for those. Use shared definitions to avoid duplicating what's common (using processor reference as in the schemaless definition linked above).
It may be more useful to normalize to dates rather than back from dates, as Solr allows more interesting searches that way, such as Date Math.
When is it safe to update the Solr schema and keep the existing indexes?
I am upgrading Solr to version 7.2 now, and some type definitions in my old schema generate warnings in the log like:
Solr loaded a deprecated plugin/analysis class [solr.CurrencyField]. Please consult documentation how to replace it accordingly.
Is it safe to update this type definition to the new solr.CurrencyFieldType and keep my existing indexes:
When the type is not used in the schema for document properties.
When the type is used in the schema for document properties.
Generally, what schema change will definitely require a total reindex of the documents?
If the field isn't being used, you can do anything you like with it - the schema is Solr's way of enforcing validation and expose certain low level Lucene settings for field configuration. If you've never indexed any content using the field, then you can update the field definition (or maybe better, remove it if you're not using it) without reindexing.
However, if you change the definition of an existing field to a different type (for example, when the int type changed from being a TrieInt to a Point field), it's a general rule that you'll have to reindex to avoid getting random weird, untraceable issues.
For TextFields, if you're not changing the field type - i.e. the field is still of the same type, but you're changing the analysis or tokenization change for the field, you might not have to reindex. If the change is only to the query part of the analysis chain, no reindexing is needed - if the change is to the indexing part (or both), it depends on what the change is - the existing tokens stored in the index won't change, so if you have indexed content without lowercasing it, and then add for example a lowercase filter for querying, you won't get a match for any existing tokens that contain uppercase. In that case you'll have to reindex to make your collection work properly again.
I have different datasources that uploads different documents to Solr Sink. Now if two datasources sends a same name field with different data types (say integer & double) then indexing of second field fails because data type of first field is already added in managed-schema.
All I need is that both fields get indexed properly as they used to work in Solr 4.x versions .
Since field names come at runtime,please suggest a solution that would work for me. I suppose it needs a change in solrconfig.xml but couldnot find the required.
How was your Solr configured to work in 4.x? You can still do it exactly the same way in Solr 6.
On the other hand, schemaless feature will define the type mapping on the first time it sees the field. It has no way to know what will come in the future. That's also why all auto-definitions are multivalued.
However, if you want to deal with specific mapping of integer being too narrow, you can change the definition of the UpdateRequestProcessor chain that is actually doing the mapping. Just merge the mapping of integer/long/number into one final tdoubles type.
I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.