Migrating from Lucene to Solr for various reasons. There is limited knowledge of Solr on my part at this time. Currently the details of the implementation are Sitecore 8.1 (Update 2) and Solr 4.10.0, per the compatibility table here.
First, the schema.xml was updated following Solution 1 here again. After running "Generate the Solr Schema.xml file" in Sitecore's control panel, the schema.xml file added a list of dynamicField elements corresponding to language codes. The initial assumption was that all the languages added to Sitecore would've been mapped, but this appears to not be the case. It appears to be more inline with what Solr supports in the base instance. It is known that one can add language codes to the schema.xml, though it seems to be tedious to add manually if the Sitecore instance has a large number of languages.
The primary concern is how the languages are being mapped from Sitecore to Solr. There are several examples of language codes needed for dynamicField elements that don't line up with Sitecore languages are even the query string that the log shows in the error message. A couple examples of the issue are shown:
org.apache.solr.common.SolrException: ERROR: [doc=sitecore://master/{234456d1-1dcd-4b53-8b63-588d8b948a69}?lang=en-no&ver=1&ndx=sitecore_master_index] unknown field 'extension_t_nn'
org.apache.solr.common.SolrException: ERROR: [doc=sitecore://master/{ed3796b0-bb9f-44a4-801f-1c26ae7ca6c4}?lang=en-cn&ver=1&ndx=sitecore_master_index] unknown field 'height_t_zh'
It is unknown how en-no resolves to nn, or how en-cn resolves to zh. Understanding this would be ideal before simply adding these language codes to the schema.xml.
Related
This is on solr 7.1.0. I have a classic schema, with the proper line in solrconfig.xml:
<schemaFactory class="ClassicIndexSchemaFactory"/>
However I still get this line in the log:
ManagedIndexSchemaFactory The schema has been upgraded to managed, but the non-managed schema schema.xml is still loadable. PLEASE REMOVE THIS FILE.
And when I inspect that core's schema it's the generic schema, not the one defined in my schema.xml.
This sounds like the core you think is loading is not the one that that is actually loading. Because:
Upgraded schema should have been the same as the original one - you are seeing a different one
You don't see managed-schema in the directory you expect it to be
You keep getting the message
So, have a look at the overview page of that core and check whether the instance directory points to where you expect it to be.
I would like to add ommitNorm=true to the title field.
It is wrongfully overboosting some of our titles.
However I don't know how the title field is indexed. What is its name - just dc.title?
Because in the schema.xml, I don't see anything about it. What is the type of that field, what analyzer or anything else is used for it. Is there anyway to know?
Most metadata fields in DSpace are handled via dynamic fields. That's why you don't see each specified individually in the search core's schema.xml file.
I'm not sure where the boosting is happening (or whether DSpace does any, even). I don't recall seeing any boost clauses when looking through the solr log files. I see some extraction parameters being set in SolrServiceImpl#writeDocument, where the document is being indexed. It looks like there is an extraction parameter for boosting individual fields, perhaps you can play with that to get what you'd like.
If you want to see the field type for any Solr field, the easiest option is probably the Schema Browser in the Solr admin user interface, eg
http://localhost:8080/solr/#/search/schema-browser?field=title (you may need to use an SSH tunnel or the like to access Solr running on a different host since the DSpace solr install is typically IP-limited to access from localhost).
In my Solr configuration files I have defined a DataImportHandler that fetches data from a Mysql database and also processes contents of PDF files that are related with registers of the SQL database. The data import works fine.
I'm trying to detect the language of text contained in the files during the data import phase. I have specified in my solrconfig.xml a TikaLanguageIdentifierUpdateProcessorFactory as explained in https://wiki.apache.org/solr/LanguageDetection and have defined in my document schema the language fields, nevertheless, after I run the indexation from the Solr admin, I cannot see any language field on my documents.
In all the examples I have seen, language detection is done by posting a document to solr with the post command, is it possible to do language detection with a DataImportHandler?
Once you have defined the UpdateRequestProcessor chain, you need to actually specify it in the request handler (DataImportHandler's in this case). You do that with update.chain parameter.
Also, ensure that you include LogUpdate and RunUpdate processors, otherwise you are not even indexing at all.
I have a small extra question that I believe is related to
Missing Id field in Solr index a little bit.
The issue is that search result contain duplication items (that was edited), amount of item depends on edit count.
It is seems like sitecore doesn't remove old item from Solr index (no item versions).
Is it Sitecore issue or some specific Solr behavior ?
I see in Solr log next message may be it is connected:
WARN null IndexSchema no uniqueKey specified in schema.
There should be a <uniqueKey> tag in your `schema.xml' file in every Solr core:
<uniqueKey>_uniqueid</uniqueKey>
It should be directly under the root <schema> tag (not inside <fields> or any other tag).
If you follow the guide for enabling Solr with Sitecore, it should be included in your schema.xml automatically.
Just found that Solr 5 doesn't require a schema file to be predefined and it generates the schema, based on the indexing being performed. I would like to know how does this work in the background?
And whether it's a good practice or not? Is there any way to disable it?
The schemaless feature has been in Solr since version 4.3. But it might be more stable only now as a concurrency issue with it was fixed in 4.10.
It is also called managed schema. When you configure Solr to use managed schema, Solr uses a special UpdateRequestProcessor to intercept document indexing requests and it guesses field types.
Solr starts with your schema.xml file and creates a new file called, by default, managed-schema to store all the inferred schema information. This file is automatically overwritten by Solr as it detects changes to the schema.
You should then use the Schema API if you want to make changes to the Schema. See also the Schemaless Mode documentation.
How to change Solr managed schema to classic schema
Stop Solr: $ bin/solr stop
Go to server/solr/mycore/conf, where "mycore" is the name of your core/collection.
Edit solrconfig.xml:
search for <schemaFactory class="ManagedIndexSchemaFactory"> and comment the whole element
search for <schemaFactory class="ClassicIndexSchemaFactory"/> and uncomment it
search for the <initParams> element that refers to add-unknown-fields-to-the-schema and comment out the whole <initParams>...</initParams>
Rename managed-schema to schema.xml and you are done.
You can now start Solr again: $ bin/solr start, go to http://localhost:8983/solr/#/mycore/documents and check that Solr now refuses to index a document with a new field not yet specified in schema.xml.
Is it a good practice? When to use it?
It depends on what you want. If you want to enforce a specific document structure (e.g. to make sure that all docs are "well-formed" according to your definition), then you want to use the classical schema management.
If on the other hand you don't know upfront what the doc structure is then you might want to use the schema-less feature.
Limits
While it is called schema-less, there are limits to the kinds of structures that you can index. This is true both for Solr and Elasticsearch, by the way. For example, if you first index this doc:
{"name":"John Doe"}
then you will get an error if you try to index a doc like that next:
{"name": {
"first": "Daniel",
"second": "Dennett"
}
}
That is because in the first case the field name was of type string while in the second case it is an object.
If you would like to use indexing which goes beyond these limitations then you could use SIREn - it is an open source semi-structured information retrieval engine which is implemented as a plugin for both Solr and Elasticsearch. (Disclaimer: I worked for the company that develops SIREn)
This is so called schemaless mode in Solr. I don't know about internal details, how it's implemented, etc.
bin/solr start -e schemaless
This snippet above will start Solr in schemaless mode, if you don't do that, it will work as usual.
For more information on schemaless, take a look here - https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode