Stop Solr replacing colons with underscores in fieldnames - solr

I'm moving a system from using Solr 1.4 to Solr 6.x (or possible 5.x) and the fields names all contain colons (e.g. "rdf:type" ). I've converted all the configuration files to Solr 6.x version using a schema.xml file. I can see "rdf:type" in Solr's schema view.
These fieldnames worked fine in 1.4 but now colons are automatically converted to underscores when indexing is attempted.
For instance using Solr's built in interface, if I try to submit a simple document like:
{'rdf:type': 'http://purl.org/ontology/bibo/Note'}
I get an error message saying:
ERROR: [doc=682e3f70-a4bc-4336-9f69-e7d620fe5fff] unknown field 'rdf_type'
Is it possible to "turn off" this feature? Will using colons cause problems with then newest versions of Solr?
(On a side note, making "rdf:type" a compulsory field and then not including it causes an error which reads: "missing required field: rdf:type", i.e. it displays the correct name)

This behaviour is not "native" to Solr itself, but is part of the default update processor chain that is added to the configuration for the Schemaless mode in the bundled examples (which is the default).
The reason is that lucene uses : to separate field names from the values to be queried in those fields, so it's usually easier to keep : out of the field name.
You can change this by removing the FieldNameMutatingUpdateProcessorFactory from the update chain, or use your own schema (without the update processor chain).

Related

How can I view actually stored transformed Solr text field values?

When Solr returns a document, the field values match those that where passed to the Solr indexer.
However especially for TextFields Solr typically uses a modified value where (depending on the definition in the schema.xml) various filters are applied, typicall:
conversion to lower case
replacing of synonyms
removal of stopwords
application of stemming
One can see the result of the conversion for specific texts by using Solr Admin > Some core > Analysis. There is a tool called Luke and the LukeRequestHandler but it seems I can only view the values passed to Solr but not the tranformed variant. One can also take a look at the index data on the disk but they seem to be stored in a binary format.
However, non of these seem to enable me to see the actual value as stored.
The reason for asking is that I've created a text field based on a certain filter chain which according to Solr Admin > Analysis transforms the text correctly. However when searching for a specific word in the transformed text it won't find it.

Solr 3.6.2 spellcheck multi-word phrase: how to get collations without ignored stopwords?

I'm having a problem with the Solr 3.6.2 default (field based) spellchecker configured with query time parameters
spellcheck.onlyMorePopular=true
spellcheck.count=5
spellcheck.collate=true
spellcheck.maxCollations=5
spellcheck.maxCollationTries=5
on a field type which has a solr.StopFilterFactory filter on its analyzers.
The suggestion phase works as intended :
the indexed field does not contain any stopword
no suggestion is provided for a given stopword
But the resulting collation always contains the ignored stopwords, which I don't want: I'd prefer a raw suggestion of combined terms over something which looks like a "sort of" natural language answer.
For instance, searching for "handfum of perries", I'd prefer "handful berry" over "handful of berry".
I don't think that the stopwords excluded from spellchecking suggestions because of the field query analyzer are "marked" for preservation like the official documentation goes about other query elements :
Note that the non-spellcheckable terms such as those for range
queries, prefix queries etc. are detected and excluded for
spellchecking. Such non-spellcheckable terms are preserved in the
collated output so that the original query can be run again, as is.
It seems two solutions would be
either having a custom query converter so the stopwords are ignored right from the start: not sure it is possible in 3.6.2
or having a custom spellchecker that would not try to find any suggestion for a stopword (or would always suggest an "empty" string), without messing up the collation process
Am I missing something ?
Regards

Solr indexing fails over media_black_point

In front i want to say that i dont have much experience with Solr.
Problem we are facing, we only want to index content of files and not want to add dynamic fields, is this possible and if so how?
Problem 2: If Problem one is a No, how would we exclude media_black_point,
media_white_point with indexing?
Error code where Solr trips:
{"responseHeader":{"status":400,"QTime":149},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"incompatible dimension (2) and values (313/1000 329/1000). Only 0 values specified","code":400}}
Dynamic Fields and schemaless mode are both there to catch fields you did not declare explicitly. If neither are used, the assumption is that every field you send to Solr (including output from extract handler that generates a Solr document internally) needs to be explicitly mapped. This helps to avoid spelling errors and other unexpected edge-cases.
If you want to ignore all the fields you did not define explicitly, you can use dynamic field with stored/indexed/docValues all set to false. Solr ships with one example out of the box, you just need to uncomment it.
The other option is to ignore specific fields. You can do that by defining a custom UpdateRequestProcessor chain (or individual URP in the latest Solr) and using IgnoreFieldUpdateProcessorFactory with your specific field name or a name pattern.

DataStax Enterprise: No search results from Solr

I’m using DataStax 3.2.7 and have 2 rows in Cassandra that show up in cqlsh.
I cannot find these records in Solr, though, even after reloading the core and fully reindexing.
Any suggestions?
I also see this in the log: Type mapping versions older than 2 are unsupported for CQL3 table linkcurrent_search.content_items,​ forcing version 2.
When you are using Dynamic Fields to query Maps in Cassandra, you must begin the Key in your Map with the prefixed map literal. In your case the prefixed map literals are :
score_calculated_
score_value_
score_velocity_
shared_on_
The reason the error 'undefined field realtime' is coming is because realtime is not prefixed by the prefix specified for that field in schema.xml.
An example of what one of your records would look like would be:
{'score_value_realtime': 18.432}
Do the same for all the map values.
For more details see this url:
http://www.datastax.com/documentation/datastax_enterprise/3.2/datastax_enterprise/srch/srchDynFlds.html

In Solr hierarchical facets, is there a way to use another character than «/» to separate nodes in the hierarchical facetting path field?

I need your help.
I'm working on a Typo3 website about mathematics, and we use :
A Solr server to provide the search engine.
A Typo3 Solr extension to provide the connection between our Typo3 CMS and our Sorr server.
We have indexed objects that are organized in a tree, and we use this tree to provide a hierarchical facets presentation for search. For this, we generate and maintain programmatically a path string, that Solr uses.
But unfortunately we happen to have slashes «/» in some of our indexed objects titles (for example those involving fractions), and that leads to unpredicable results when rendering the hierarchical facets based on these titles, because Solr interprets the slashes as a child node.
We cannot use HTML entitizing and de-entitizing because we would loose the search features on the names, unless we manage everywhere encoding and recoding of the special characters, which we do have no time to do.
My question is simple :
Is there a way to configure a separator char for the hierarchical facets path ? For example in typoScript a neat simple configuration key :
plugin.tx_solr.index.fieldProcessingInstruction.separator = ### #<--Whatever...
I would be so glad to not have to dive again in the Typo3 Solr extension source code to bugfix my website !
Thanks to anybody for any clue.
OK, after having lost some time trying to configure it in the schema.xml and in general_schema_*.xml files, I went to the source code of the Typo3 Solr extension, my old dreaded sleeping balrog.
It appears that the separator character is specified hardcoded in 5 scattered class files :
class.tx_solr_facet_hierarchicalfacetrenderer.php
class.tx_solr_fieldprocessor_pathtohierarchy.php
class.tx_solr_facet_hierarchicalfacethelper.php
class.tx_solr_fieldprocessor_pageuidtohierarchy.php
class.tx_solr_query_filterencoder_hierarchy.php
All I did was replace it in these files (pointing to one unique public static constant, duh) and apologize to my supervisors for taking so long correcting a so simple and stupid bug, and now everything works fine !

Resources