Custom version number for Solr schema - solr

Is it possible to store a custom version number somewhere in the Solr schema so that it could be retrieved by the client in order to verify that it is connected to a compatible Solr instance?
When I'm deploying a new version of the application to QA or production I need to be sure that all the data sources (Solr, RDBMS, etc) my app is connected to have been properly updated/migrated. So I want to perform some validation at the startup. It's easy with the database (e.g. storing current schema version in the VERSION table), but it's less obvious where to store the version information for the Solr schema.

The SystemInfoHandler will provide the version along with other information about the Solr instance. In later versions of Solr (3.x & 4.x), this is already enabled as part of the admin requestHandler.
You can access the information via http://localhost:8983/solr/admin/system from the example site distributed with Solr. Modify the url accordingly for your Solr configuration.
Note: If you are running an older version of Solr this can be enabled by adding the following line to the solrconfig.xml file.
<requestHandler name="/admin/system" class="solr.admin.SystemInfoHandler" />
Update:
For the specific scenario of knowing when the schema has changed (e.g. version the schema) can be accomplished by updating the name attribute of root node every time the schema file is modified. This name value will then be available in the SystemInfoHandler response.

Related

Managed-schema.xml file is overwritten when I populate Solr Managed Schema from Sitecore

In my solr managed-schema.xml file I added the following:
<copyField source="computedtitle_t" dest="computedtitlecopy_t" />
When I populate-schema from Sitecore, the managed-schema file is overwritten and so are my changes
Is there a patch file on the Sitecore side where I can add this and to what section?
Yes, Sitecore manages the Solr schema for you through the populate-schema function in the Control Panel. This is done via the SchemaPopulateHelper. You can implement your own class, implementing the ISchemaPopulateHelper interface and register it in the config.
A while back, I wrote a generic implementation of this where you can put your entire managed schema as part of the Sitecore config instead. This also allows leveraging from the Sitecore config file patch feature, so that your schema changes can go along with other Sitecore configs if needed.
You can read more about it here: https://mikael.com/2020/10/dealing-with-solr-managed-schema-through-sitecore-config-files/
Here are some more generic info about how Sitecore works with Solr and managed schema: https://mikael.com/2018/01/working-with-content-search-and-solr-in-sitecore-9/
You can use the code here as a starting point: https://github.com/mikaelnet/sitecore-solr-config
Please note that there was a small interface change in Sitecore 9.3 (I think), so the sample code may need some changes for it to work. Also, make sure you start with a managed schema that is equal to the one that's provided with the Sitecore version you're using. There may be a few changes in the default schema between the versions.

Solr - Migrate Documents from one Collection to another existing one

I need to move all Solr Documents from one collection to another (already existing collection) - there are 500,000 documents.
I have tried the solr migrate but cannot get the routing key correct. I have tried:
curl 'http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=oldCollection&target.collection=newCollection&split.key=!'
I have solr 4.10.3 installed in a cloudera installation.
Copy your existing oldCollection, and rename the as newCollection,
After that you may need to update some config files for the same.
Or create a new one using the below api
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1
The answer and the question are quite old, starting from 8.1 solr version, there is a feature specific for this purpose which is the reindexcollection api which can directly be used to reindex docs from source to a target collection with a lot of configurable options. Here is the link to the official doc : https://lucene.apache.org/solr/guide/8_1/collections-api.html#reindexcollection

Solr luceneMatchVersion syntax

I have Solr 4.10 and I have collection on it with solorconfig.xml has the value for <luceneMatchVersion> as follows:
<luceneMatchVersion>4.7</luceneMatchVersion>
Is this correct? I saw other examples that has values such as LUCENE_35 What I need to know also, how could I express LUCENE_xx from my current Solr version?
You should use:
<luceneMatchVersion>4.10.4</luceneMatchVersion>
I recommend you to check your current solr version, in my case was 4.10.4.
if you are going to reindex, then both numbers should match. The only reason you might want to have them different, is if you had and index created with say Lucene 4.7, then you would have
<luceneMatchVersion>4.7</luceneMatchVersion>
Then, you upgrade lucene to 4.10.
Now, if among the changes in between 4.7 and 4.10 there are things that work differently regarding analysis (you get the same sentence analysed in both versions and get different output as a result), then, you might want to keep the version number at 4.7, otherwise some queries that contain affected terms might not work (as they were analysed at index time in a different way than at query time). You have to asses how critical that issue might be.
That is why the recommendation is to upgrade, change the setting to the current number, and reindex. This way you are sure to avoid any issue.
If anyone is using Drupal, the Search API Solr (search_api_solr) module has config templates by version in /sites/all/modules/search_api_solr/solr-conf/.
The template README.md states the following:
The solr-conf-templates directory contains config-set templates for
different Solr versions.
These are templates and are not to be used as config-sets!
To get a functional config-set you need to generate it via the Drupal
admin UI or with drush solr-gsc. See README.md in the module
directory for details.
The module's README.md lists these instructions:
Make sure you have Apache Solr started and accessible (i.e. via port 8983). You can start it without having a core configured at
this stage.
Visit Drupal configuration (/admin/config/search/search-api) and create a new Search API Server according to the search_api
documentation using "Solr" as Backend and the connector that
matches your setup. Input the correct core name (which you will
create at step 4, below).
Download the config.zip from the server's details page or by using drush solr-gsc with proper options, for example for a server named
"my_solr_server": drush solr-gsc my_solr_server config.zip 8.4.
Copy the config.zip to the Solr server and extract.
I generated a config file for 8.x, and it uses this:
<luceneMatchVersion>${solr.luceneMatchVersion:LUCENE_80}</luceneMatchVersion>

Update solr schema.xml in real time for Solr 4.10.1

I understand that in Solr 5.0, they provide a REST API to do real-time update of the schema using Curl. However, I could not do that for my eariler version of Solr 4.10.1.
Would like to check, is this function available for the earlier version of Solr, and is the curl syntax the same as Solr 5.0?
According to Solr Wiki, it's possible to request schema from Solr 4.2 and modify it starting from Solr 4.4
In order to enable schema modifications via the Schema REST API, the
schema implementation must be declared as managed by Solr, that is,
not to be manually edited.
Further, the schema must be configured as mutable in order to make
modifications to it.
Both of these schema features (managed and mutable) are configured via
the element in solrconfig.xml.
More information - https://wiki.apache.org/solr/SchemaRESTAPI

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources