Solr reinterprets field during replication - solr

I've got a Solr (version 4.10.3) cloud consisting of 3 Solr instances managed by Zookeeper. Each core is replicated from the current leader to the other 2 for redudancy.
Now to the problem. I need to index a datetime field from SQL as a TextField for wildcard queries (not the best solution, but a requirement non the less). On the core that does the import, everything looks like it should and the field contains values like: 2008.10.18 17:16:31.0 but the corresponding document (synced by the replicationhandler) on the other cores has values like: Sat Oct 18 17:16:31 CEST 2008 for the same field. I've been trying for a while to get to the bottom of this without success. The behavior of both the core and the cloud is as intended aside from this.
Does anyone have an idea of what im doing wrong?
The fieldType looks like this:
<fieldType name="stringD" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([-])" replacement="." replace="all" />
</analyzer>
</fieldType>
Here is a link to a screenshot showing the behavior in all its glory, the top part is from the core that did the full-import.

So my first answer goes to my first question here ;)
When initially setting this core up an import-query like this was used.
SELECT * FROM [TABLE]
and then the fields were mapped like this in the data-import-handler.
<field column="ENDTIME" name="ENDTIME" />
When the Solr started to convert the content of the [ENDTIME] (datetime2) column in SQL to a date, this was added to the import-query.
CAST(CAST(ENDTIME as datetime2(0)) as varchar(100)) as ENDTIMESTR
to force the correct format from SQL: 2008-10-18 17:16:31.0.
The data-import-handler mapping was also changed to the following:
<field column="ENDTIMESTR" name="ENDTIME" />
Because of this, both [ENDTIME] and [ENDTIMESTR] came from SQL into the data-import-handler and somehow Solr was only able to use the correct field/fieldType on the core which initiated the full-import. When replicating the field to the other cores Solr seems to have looked at the original [ENDTIME] column (only existing in the data-import-handler during a full/delta-import, remember SELECT * FROM [TABLE]). ENDTIME in the Solr-schema was a TextField all along.
SOLUTION: Removing the * and instead explicitly define all fields in the full/delta-queries with [ENDTIME] looking like this CAST(CAST(ENDTIME as datetime2(0)) as varchar(100)) as ENDTIME.
Everything now behaves as intended. I guess there's a bug in the data-import-handler mapping somewhere but my configuration wasn't really the best either.
Hope this can help someone else out on a slippery-Solr-slope!

Related

solr migration pdate vs. tdate

I'm migrate my solr environment from 6.3 to 7.2 and walks trough all the config files.
In 6.3 I have a lot of date files, using the tdate filedType, which uses solr.TrieDateField.
<fieldType name="tdate" class="solr.TrieDateField" positionIncrementGap="0" docValues="true" precisionStep="6"/>
In Solr 7 the tdate field is no more part of the default schema file. Instead of tdate, solr 7 seams to use pdate:
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
Looking at this "Solr 7 fieldTypes doc" It seams like tdate is no more avaliable in solr 7.x
Can and should I change all the fileds using tdate to pdate?
First, if you want, you can still use TrieDateField if you want not to change anything. It deprecated, but not removed. If this declaration
<fieldType name="tdate" class="solr.TrieDateField" positionIncrementGap="0" docValues="true" precisionStep="6"/>
is missing in your schema, add it.
But, can you change to pdate? Sure, if it is easy to reindex for you, you can change and reindex. Should you? The newer type is more efficient, but for some usecases the new types were less performant than the older one, if you have a good testbed that reflect your real world usage, the best thing would be to benchmark both, if the newer ones perform at least as well as the older ones, I would say, upgrade.

Field in schema-browser screen in Solr admin Console

Above is the screenshot attached for the schema browser screen for a particular index. The field is brandName.
Field type is defined as following:
<fieldType name="wc_keywordText" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Indexed, Tokenized, stored ... etc are the properties of field. Can anyone explain what it signifies for the rest like Schema and Index(Colored in red box).
I think, this is describing where these properties for a field are coming from. Initially, when you have an empty index, this screen contains only properties row, which lead me to the intuition, that properties are take from schema.xml
Index row appears only after I added some documents to the Solr index. For example, my id field isn't stored and than, I do not have information in this row for this field (pay attention to the (unstored field) text)
And the row Schema, here is a bit tricky to me. I was thinking that this has something to do with Schema API, like when you create field/update field via REST calls, than this Schema row will represent. However, it turns out different, if I modify the field type (for example add support for docValues for the field, which didn't have it), you will have this screen.
It leads me to idea, that Schema row actually represents what is happening in the schema, while properties have the current one. Remember, I've add support for docValues. Which leads me to the idea, that if you have ClassicIndexSchemaFactory, than Schema and properties row should be the same, if you have ManagedIndexSchemaFactory, that these rows could be different.

Solr - search word immediately followed by partial match (with wildcard)

I have a Solr index filled with documents, with a field named issuer.
There is a document with issuer=first issuer.
I'm trying to implement matching of two consequent words. The first word needs to match completely, the second needs to match partially.
What I am trying to achieve is:
I search for something like: issuer:first\ iss*
I expect it to match "first iss uer"
I tried the following solutions but none is working:
issuer:first\ iss* -> returns nothing
issuer:"first iss"* -> returns everything
issuer:(first iss*) -> also returns "issuer first"
Does anybody have a clue on how to achieve the desired result?
My suggestion is to add a shiringle filter based field type to your schema. Below is a simple definition:
<fieldtype name="shingle">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
</fieldtype>
You then add another field with this type as shown below:
<field name="issuer_sh" type="shingle" indexed="true" stored="false"/>
At query time, you can issue the following query:
issuer_sh:"first iss*"
The shingleFilter creates n-gram tokens from your text. For instance, if the issuer field contains "first issue", then Solr will create and index the following tokens:
first
issue
first issue
You can't search with wildcards in phrase queries. Without changing how you are indexing (see #ameertawfik's answer), the standard query parser doesn't provide a good way to do this. You can, however, use the surround query parser to search using spans. This query would then look like:
1N(first, iss*)
Keep in mind, surround query parser does not analyze, so 1N(first, iss*) and 1N(First, iss*) will not find the same results.
You could also construct this query using lucene's SpanQueries directly, of course, like:
SpanQuery[] queries = new SpanQuery[2];
queries[0] = new SpanTermQuery(new Term("issuer","first"));
queries[1] = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("issuer","iss")));
Query finalQuery = new SpanNearQuery(queries, 0, true);

Show all occurrences of query while highlighting in solr 1.4

I have a solr setup(1.4) having a text field with ebook data. The params while hitting solr are -
"hl.fragsize":"0",
"indent":"1",
"hl.simple.pre":"{{{",
"hl.fl":"body_eng",
"hl.maxAnalyzedChars":"-1",
"wt":"json",
"hl":"true",
"rows":"1",
"fl":"ia,body_length,page_count",
"q":"ia:talesofpunjabtol00stee AND PUNJAB",
"q.op":"AND",
"f.body_eng.hl.snippets":"428",
"hl.simple.post":"}}}",
"hl.usePhraseHighlighter":"true"}},
However, the results show only 20 highlighted occurrences of word PUNJAB.
I tried f.body_eng.hl.snippets":"428" but this even isnt working.
body_eng is a big text field. The highlighting works only till some length. I have tried other words as well. In all the examples, highlighting works till around 54K letter counts.
What could be the reason?
First of all: 1.4 is a very old version of Solr. I'm not sure if per field values were supported at that time (Highlighting itself was introduced with Solr 1.3). The default highlighter was changed in 3.1.
You should however be able to highlight all occurences in a field by supplying a large value for hl.maxAnalyzedChars (not sure if -1 will do what you want). Another option to try should be to have a large hl.maxAnalyzedChars value and a large hl.fragsize value (use the same value for both fields and not 0).
If you're still unable to get it to work, test it on a more recent version of Solr to see if it's an issue that has already been fixed.
So, after lot of asking around, Its working now.
The query params is correct. The schema was causing problems. Changes done were -
<filter class="solr.SnowballPorterFilterFactory" language="English" />
was replaced with
with <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

How to use SynonymFilterFactory in Solr?

I'm trying to execute synonym filtering at query time so that if I search for X, results for Y also show up.
I go to where Solr is being run, edit the .txt file and add X, Y on a new line.
This does not work. I check the schema and I see:
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
What am I missing?
EDIT
Assessing configuration files
tomcat6/Catalina/localhost seems to point to the correct location
<Context docBase="/data/solr/solr.war" debug="0" privileged="true" allowLinking="true" crossContext="true">
<Environment name="solr/home" type="java.lang.String" value="/data/solr" override="true" />
</Context>
Also, in the Solr admin I see this. What does cwd mean?
cwd=/usr/share/tomcat6 SolrHome=/data/solr/
Use the SynonymFilterFactory only at index time, not query time. There are some subtle but well-understood problems with synonyms at query time.
See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
After you move synonyms to the index analyzer chain, check that they are working with the Analysis page in the admin UI.
The answer from #Walter Underwood is good, but incomplete.
Whether you use the SynonymFilterFactory at index or query time depends on your default operator.
So, let's say we have a synonym file with this entry:
5,five
If your default operator is OR (which is the default default operator), then you should set up your synonyms on the query filter. This way a query for "5" will be passed to the backend as a query for "5" OR "five", say, and the backend would respond appropriately. At the same time, you can make changes to your synonym file without reindexing, and your index is smaller since it doesn't have to have so many tokens.
However, if you change the default operator to AND, you should set up your synonyms on the index filter instead. If you don't, a query for "5" would go to the backend as "5" AND "five", and it would not match the documents that it's expected to. This, alas, makes the index bigger, and also means new synonyms require complete reindexes.
Note: The documentation for this is currently wrong, leaving out all these details.

Resources