Get Solr field value from text - solr

I am running a script that takes paragraphs from a text and converts them to individual text files. At the top of the file I write the following properties:
title: The Constitution of Athens
author: Aristotle
book: 1
section 2
paragraph: 5
text goes here....
I would like to create a Solr Schema that mirrors the fields I describe and index them accordingly for each document. I know I can define field in the schema with an analyzer as follow:
<fieldType name="title" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.core.<analyzer>"/>
</fieldType>
But don't know what kind of analyzer I should use so that it detects the specific value. Thank you.

Related

Field in schema-browser screen in Solr admin Console

Above is the screenshot attached for the schema browser screen for a particular index. The field is brandName.
Field type is defined as following:
<fieldType name="wc_keywordText" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Indexed, Tokenized, stored ... etc are the properties of field. Can anyone explain what it signifies for the rest like Schema and Index(Colored in red box).
I think, this is describing where these properties for a field are coming from. Initially, when you have an empty index, this screen contains only properties row, which lead me to the intuition, that properties are take from schema.xml
Index row appears only after I added some documents to the Solr index. For example, my id field isn't stored and than, I do not have information in this row for this field (pay attention to the (unstored field) text)
And the row Schema, here is a bit tricky to me. I was thinking that this has something to do with Schema API, like when you create field/update field via REST calls, than this Schema row will represent. However, it turns out different, if I modify the field type (for example add support for docValues for the field, which didn't have it), you will have this screen.
It leads me to idea, that Schema row actually represents what is happening in the schema, while properties have the current one. Remember, I've add support for docValues. Which leads me to the idea, that if you have ClassicIndexSchemaFactory, than Schema and properties row should be the same, if you have ManagedIndexSchemaFactory, that these rows could be different.

Using Solr to search for lname 'smith', I am getting 'smitty and 'smits'

I have defined this field in the schema as follows:<field name="lname" type="string" indexed="true" stored="true"/>
according to the docs in schema:
The StrField type is not analyzed, but indexed/stored verbatim.
It supports doc values but in that case the field needs to be
single-valued and either required or have a default value.
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
Yet when I query lname:smith, atleast in the top 10 responses I am getting smitty and smits but not smith. Why smith is not being returned.
However, when I try lname:smith* I get many results.
Any pointer to get the correct result is appreciated.
I am using Solr 6.3
Did you reindexed the data after the change in the schema? Are you sure that you didn't have that field previously defined as solr.TextField? Because the solr.StrField stores the entire value as a single term in the index.
Use the analysis section on the Admin UI to check which terms are being generated on your lname field, and check that you've reindexed the data after the schema change. Do you have some other special component defined in the update handler that you're using?

Solr query data with white space needs to be queried

I am new to solr. I have data in solr something like "name":"John Lewis".
Query formed looks and searches perfectly as fq=name%3A+%22John+Lewis%22
This is formed in Solr console and works well.
My requirement is to search a particular word coming from my Java layer as "JohnLewis". It has to be mapped with "John Lewis" in solr repo.
This search is not just restricted to name field(2 words and a space in-between).
I have some other details like "Cash Reward Credit Cards", which has 4 words and user would query like "CashRewardCreditCards".
Could someone help me on this, if this can be handled in schema.xml with any parsers that is available in solr.
You need to create custom fieldType.
First define a fieldType in your solr schema :
<fieldType name="word_concate" class="solr.TextField" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s*" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Here we named the fieldType as word_concate.
We used CharFilterFactories's solr.PatternReplaceCharFilterFactory
Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters and placed in front of a Tokenizer. PatternReplaceCharFilterFactory filter uses regular expressions to replace or change character patterns
Pattern : \s* means zero or more whitespace character
Second create a field with word_concate as type :
<field name="cfname" type="word_concate"/>
Copy your name field to cfname with copy field
<copyField source="name" dest="cfname"/>
Third reindex the data.
Now you can query : cfname:"JohnLewis" it will return name John Lewis
Assuming your input is CamelCase as shown I would use Solr's Word Delimiter Filter
with the splitOnCaseChange parameter on the query side of your analyzer as a starting point. This will take an input token such as CashRewardCreditCards and generate the tokens Cash Reward Credit Cards
See also:
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
Look at WordDelimiterFilterFactory
It has a splitOnCaseChange property. If you set that to 1, JohnLewis will be indexed as John Lewis.
You'll need to add this to your query analyzer. If the user searches for JohnLewis, the search will be translated to John Lewis.

Solr - search word immediately followed by partial match (with wildcard)

I have a Solr index filled with documents, with a field named issuer.
There is a document with issuer=first issuer.
I'm trying to implement matching of two consequent words. The first word needs to match completely, the second needs to match partially.
What I am trying to achieve is:
I search for something like: issuer:first\ iss*
I expect it to match "first iss uer"
I tried the following solutions but none is working:
issuer:first\ iss* -> returns nothing
issuer:"first iss"* -> returns everything
issuer:(first iss*) -> also returns "issuer first"
Does anybody have a clue on how to achieve the desired result?
My suggestion is to add a shiringle filter based field type to your schema. Below is a simple definition:
<fieldtype name="shingle">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
</fieldtype>
You then add another field with this type as shown below:
<field name="issuer_sh" type="shingle" indexed="true" stored="false"/>
At query time, you can issue the following query:
issuer_sh:"first iss*"
The shingleFilter creates n-gram tokens from your text. For instance, if the issuer field contains "first issue", then Solr will create and index the following tokens:
first
issue
first issue
You can't search with wildcards in phrase queries. Without changing how you are indexing (see #ameertawfik's answer), the standard query parser doesn't provide a good way to do this. You can, however, use the surround query parser to search using spans. This query would then look like:
1N(first, iss*)
Keep in mind, surround query parser does not analyze, so 1N(first, iss*) and 1N(First, iss*) will not find the same results.
You could also construct this query using lucene's SpanQueries directly, of course, like:
SpanQuery[] queries = new SpanQuery[2];
queries[0] = new SpanTermQuery(new Term("issuer","first"));
queries[1] = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("issuer","iss")));
Query finalQuery = new SpanNearQuery(queries, 0, true);

How to force solr QParserPlugin not to use whitespace tokenizer for Keyword fields?

I have keyword field in Solr schema.
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.SimpleKeywordTokenizerFactory"/>
</analyzer>
</fieldType>
When I try to search this field with default solr query or dismax query
category:(Mouse Pad) it creates query (category:Mouse) AND (category:Pad)
I want to know is there a way not to split terms by whitespaces if it is keyword field or so.
Added:
I need SimpleKeywordTokenizerFactory analyze (which is lowercase without white-space splitting) on query, so raw and term query parser doesn't work for me
You want to enter this query:
category:"Mouse Pad"
The query syntax already provides a way to do this. Quotes are for phrases. Parentheses mean something different. You can write your own query parser if you want, but I don't recommend it.
You could use TermQParserPlugin:
{!term f=category}Mouse Pad
Beware that no analysis is performed, so this will only work if the internal representation of your field is "Mouse Pad" (with title case).
Edit (2012-04-17):
If you still want analysis to be performed, all you need to do is to escape the space by prepending a backslash:
{!lucene}category:Mouse\ Pad

Resources