Above is the screenshot attached for the schema browser screen for a particular index. The field is brandName.
Field type is defined as following:
<fieldType name="wc_keywordText" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Indexed, Tokenized, stored ... etc are the properties of field. Can anyone explain what it signifies for the rest like Schema and Index(Colored in red box).
I think, this is describing where these properties for a field are coming from. Initially, when you have an empty index, this screen contains only properties row, which lead me to the intuition, that properties are take from schema.xml
Index row appears only after I added some documents to the Solr index. For example, my id field isn't stored and than, I do not have information in this row for this field (pay attention to the (unstored field) text)
And the row Schema, here is a bit tricky to me. I was thinking that this has something to do with Schema API, like when you create field/update field via REST calls, than this Schema row will represent. However, it turns out different, if I modify the field type (for example add support for docValues for the field, which didn't have it), you will have this screen.
It leads me to idea, that Schema row actually represents what is happening in the schema, while properties have the current one. Remember, I've add support for docValues. Which leads me to the idea, that if you have ClassicIndexSchemaFactory, than Schema and properties row should be the same, if you have ManagedIndexSchemaFactory, that these rows could be different.
Related
I created my own core on http://localhost:8983/solr and added some documents so I could query. But When I query something like"dog", I want those documents that contains "pooch" will be returned too. So I want to implement SVD algorithm to make some improvement on my results.
Since I am new to the search engine thing. All I know is that I can use Mahout to implement SVD, but it seems a little bit difficult coz I have to install Maven, Hadoop and Mahout.
Any suggestion will be appreciated.
You can use SynonymGraphFilterFactory
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter.
Create a file i.e mysynonyms.txt in the directory your_collection/conf/ and put the synonyms with => sign
pooch,pup,fido => dog
huge,ginormous,humungous => large
And Example Schema will be :
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
Source : https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions
The is another way to augment your index with terms not in the content. Synonyms is good as #ashraful says. But there are 2 other problems you will run into:
words used but not in the synonym list
behavioral search: using other user behavior as a hint to what they are looking for
These require you to augment the index with terms learned from 1) other searches, and 2) user behavior. Mahout's Correlated Cross Occurrence algorithm can help with both. You can set it up to find terms that lead to people reading an item and (if you have something like purchase or other preference data) conversion items that correlate with items in the index. In the second case you would add user conversions to the search query to personalize the results.
A blog about the technique here: http://actionml.com/blog/personalized_search
The page on Mahout docs here: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
You should also look at word2vec, which will (given the right training data) find that "dog" and "pooch" are synonyms regardless of the synonym list because it is learned from the data. I'm not sure how you add word2vec to Solr but it is integrated into Fusion, the closed source product of Lucid.
I am new to solr. I have data in solr something like "name":"John Lewis".
Query formed looks and searches perfectly as fq=name%3A+%22John+Lewis%22
This is formed in Solr console and works well.
My requirement is to search a particular word coming from my Java layer as "JohnLewis". It has to be mapped with "John Lewis" in solr repo.
This search is not just restricted to name field(2 words and a space in-between).
I have some other details like "Cash Reward Credit Cards", which has 4 words and user would query like "CashRewardCreditCards".
Could someone help me on this, if this can be handled in schema.xml with any parsers that is available in solr.
You need to create custom fieldType.
First define a fieldType in your solr schema :
<fieldType name="word_concate" class="solr.TextField" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s*" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Here we named the fieldType as word_concate.
We used CharFilterFactories's solr.PatternReplaceCharFilterFactory
Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters and placed in front of a Tokenizer. PatternReplaceCharFilterFactory filter uses regular expressions to replace or change character patterns
Pattern : \s* means zero or more whitespace character
Second create a field with word_concate as type :
<field name="cfname" type="word_concate"/>
Copy your name field to cfname with copy field
<copyField source="name" dest="cfname"/>
Third reindex the data.
Now you can query : cfname:"JohnLewis" it will return name John Lewis
Assuming your input is CamelCase as shown I would use Solr's Word Delimiter Filter
with the splitOnCaseChange parameter on the query side of your analyzer as a starting point. This will take an input token such as CashRewardCreditCards and generate the tokens Cash Reward Credit Cards
See also:
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
Look at WordDelimiterFilterFactory
It has a splitOnCaseChange property. If you set that to 1, JohnLewis will be indexed as John Lewis.
You'll need to add this to your query analyzer. If the user searches for JohnLewis, the search will be translated to John Lewis.
I've got a Solr (version 4.10.3) cloud consisting of 3 Solr instances managed by Zookeeper. Each core is replicated from the current leader to the other 2 for redudancy.
Now to the problem. I need to index a datetime field from SQL as a TextField for wildcard queries (not the best solution, but a requirement non the less). On the core that does the import, everything looks like it should and the field contains values like: 2008.10.18 17:16:31.0 but the corresponding document (synced by the replicationhandler) on the other cores has values like: Sat Oct 18 17:16:31 CEST 2008 for the same field. I've been trying for a while to get to the bottom of this without success. The behavior of both the core and the cloud is as intended aside from this.
Does anyone have an idea of what im doing wrong?
The fieldType looks like this:
<fieldType name="stringD" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([-])" replacement="." replace="all" />
</analyzer>
</fieldType>
Here is a link to a screenshot showing the behavior in all its glory, the top part is from the core that did the full-import.
So my first answer goes to my first question here ;)
When initially setting this core up an import-query like this was used.
SELECT * FROM [TABLE]
and then the fields were mapped like this in the data-import-handler.
<field column="ENDTIME" name="ENDTIME" />
When the Solr started to convert the content of the [ENDTIME] (datetime2) column in SQL to a date, this was added to the import-query.
CAST(CAST(ENDTIME as datetime2(0)) as varchar(100)) as ENDTIMESTR
to force the correct format from SQL: 2008-10-18 17:16:31.0.
The data-import-handler mapping was also changed to the following:
<field column="ENDTIMESTR" name="ENDTIME" />
Because of this, both [ENDTIME] and [ENDTIMESTR] came from SQL into the data-import-handler and somehow Solr was only able to use the correct field/fieldType on the core which initiated the full-import. When replicating the field to the other cores Solr seems to have looked at the original [ENDTIME] column (only existing in the data-import-handler during a full/delta-import, remember SELECT * FROM [TABLE]). ENDTIME in the Solr-schema was a TextField all along.
SOLUTION: Removing the * and instead explicitly define all fields in the full/delta-queries with [ENDTIME] looking like this CAST(CAST(ENDTIME as datetime2(0)) as varchar(100)) as ENDTIME.
Everything now behaves as intended. I guess there's a bug in the data-import-handler mapping somewhere but my configuration wasn't really the best either.
Hope this can help someone else out on a slippery-Solr-slope!
I have a Solr index filled with documents, with a field named issuer.
There is a document with issuer=first issuer.
I'm trying to implement matching of two consequent words. The first word needs to match completely, the second needs to match partially.
What I am trying to achieve is:
I search for something like: issuer:first\ iss*
I expect it to match "first iss uer"
I tried the following solutions but none is working:
issuer:first\ iss* -> returns nothing
issuer:"first iss"* -> returns everything
issuer:(first iss*) -> also returns "issuer first"
Does anybody have a clue on how to achieve the desired result?
My suggestion is to add a shiringle filter based field type to your schema. Below is a simple definition:
<fieldtype name="shingle">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
</fieldtype>
You then add another field with this type as shown below:
<field name="issuer_sh" type="shingle" indexed="true" stored="false"/>
At query time, you can issue the following query:
issuer_sh:"first iss*"
The shingleFilter creates n-gram tokens from your text. For instance, if the issuer field contains "first issue", then Solr will create and index the following tokens:
first
issue
first issue
You can't search with wildcards in phrase queries. Without changing how you are indexing (see #ameertawfik's answer), the standard query parser doesn't provide a good way to do this. You can, however, use the surround query parser to search using spans. This query would then look like:
1N(first, iss*)
Keep in mind, surround query parser does not analyze, so 1N(first, iss*) and 1N(First, iss*) will not find the same results.
You could also construct this query using lucene's SpanQueries directly, of course, like:
SpanQuery[] queries = new SpanQuery[2];
queries[0] = new SpanTermQuery(new Term("issuer","first"));
queries[1] = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("issuer","iss")));
Query finalQuery = new SpanNearQuery(queries, 0, true);
I'm learning Solr and have become confused trying to figure out ICUCollation, what it does, what it is for and how to use it. From here. I haven't found any good explanation of this online. The doc appear to be saying that I need to use this ICUCollation and implies that it does magical things for me, but does not seem to explain exactly why or exactly what, and how it integrates with anything else.
Say I have a text field in French and I want stopwords removed, accents, punctuation and case ignored and stemming... how does ICUCollation come into this? Do I set solr.ICUCollationField and locale='fr' and it will do everything else automatically? Or do I set solr.ICUCollationField and then tokenizer and filters on this in addition? Or do I not use solr.ICUCollationField at all because that's for something completely different? And if so, then what?
Collation is the organisation of written information into an order - ICUCollactionField (the API documentation also provides a good description) is meant to enable you to provide locale aware sorting, as the sort order is defined by cultural norms and specific language properties. This is useful to allow different sorting based on those rules, such as the difference between Norwegian and Swedish, where a Swede would order Å before Æ/Ä and Ø/Ö, while a Norwegian would order it Æ/Ä, Ø/Ö and then Å.
Since you usually don't want to sort by a tokenized field (exception: KeywordTokenizer) or a multivalued field, these fields are usually not processed any more than allowing for the sorting / collation to be performed.
There is a case to be made for collation filters for searching as well, as search in practice is just comparison. This means that if you're aiming to search for two words that would be identical when compared in the locale provided, it would be a hit. The tokens indexed will not make any sense when inspected, but as long as the values are reduced to the same token both when indexing and searching, it would work. There's an example of this on the wiki under UnicodeCollation.
Collation does not affect stopwords (StopFilterFactory), accents (ICUFoldingFilterFactory), punctuation, case (depending on locale - if the locale for sorting is case aware, then it does not) (LowercaseFilterFactory or ICUNormalizer2FilterFactory) or stemming (SnowballPorterFilterFactory). Have a look at the suggested filters for that. Most filters or tokenizers in Solr does very specific tasks, and try to avoid doing "everything and the kitchen sink" in one single filter.
You normally have two or more fields for one text input if you want to do different things like:
search: text analysis
sort: language sensitive / case insensitive sorting
facet: string
For search use something like:
<fieldType name="textFR" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
For sorting use:
<fieldType name="textSortFR" class="solr.ICUCollationField"
locale="fr"
strength="primary" />
or simply:
<fieldType name="textSort" class="solr.ICUCollationField"
locale=""
strength="primary" />
(If you have to support many languages. Should work fine enough in most cases.)
Do make use of the Analysis UI in the SOLR Admin: open the analysis view for your index, select the field type (e.g. your sort field), add a representative input value in the left text area and a test value in the right field (in case of sorting, this right side value is not as interesting as the sort field is not used for matching).
The output will show you whether:
accents are removed
elisions are removed
lower casing is applied
etc.
For example, if you see that elisions (l'atelier) are not remove (atelier) but you would like to discard it for sorting you would have to add the elision filter (see example for search field type above).
https://cwiki.apache.org/confluence/display/solr/Language+Analysis